Pause Your ML Pipelines for Human Review Using AWS Step Functions + Slack

0
9


Have you ever wanted to pause an automated workflow to wait for a human decision?

Maybe you need approval before provisioning cloud resources, promoting a machine learning model to production, or charging a customer’s credit card.

In many data science and machine learning workflows, automation gets you 90% of the way — but that critical last step often needs human judgment.

Especially in production environments, model retraininganomaly overrides, or large data movements require careful human review to avoid expensive mistakes.

In my case, I needed to manually review situations where my system flagged more than 6% of customer data for anomalies — often due to accidental pushes by customers.

Before I implemented a proper workflow, this was handled informally: developers would directly update production databases (!) — risky, error-prone, and unscalable.

To solve this, I built a scalable manual approval system using AWS Step FunctionsSlackLambda, and SNS — a cloud-native, low-cost architecture that cleanly paused workflows for human approvals without spinning up idle compute.

In this post, I’ll walk you through the full design, the AWS resources involved, and how you can apply it to your own critical workflows.

Let’s get into it

The Solution

My application is deployed in the AWS ecosystem, so we’ll use Aws Step Functions to build a state machine that:

  1. Executes business logic
  2. Lambda with WaitForTaskToken to pause until approval
  3. Sends a Slack message requesting approval (can be an email/)
  4. Waits for a human to click “Approve” or “Reject”
  5. Resumes automatically from the same point
The Step function flow

Here is a youtube video showing the demo and actual application in action:

I have also hosted the live demo app here →
👉 https://v0-manual-review-app-fwtjca.vercel.app
All code is hosted here with the right set of IAM permissions.


Step-by-Step Implementation

  1. Now we will create the Step Function with a manual review flow step. Here is the step function definition:
Step function flow with definition

The flow above generates a dataset, uploads it to AWS S3 and if a review is required, then invokes the Manual Review lambda. On the manual review step, we’ll use a Task lambda with an invoke on WaitForTaskToken, which pauses execution until resumed. The lambda reads the token this way:

<code class="language-Python">def lambda_handler(event, context):

  config = event["Payload"]["config"]
  task_token = event["Payload"]["taskToken"] # Step Functions auto-generates this

  reviewer = ManualReview(config, task_token)
  reviewer.send_notification()

  return config

This Lambda sends a Slack message that includes the task token so the function knows what execution to resume.

2. Before the we send out the slack notification, we need to

  1. setup an SNS Topic that receives review messages from the lambda
  2. a slack workflow with a web-hook subscribed to the SNS topic, and a confirmed subscription
  3. an https API Gateway with approval and rejection endpoints.
  4. a lambda function that processes the API Gateway requests: code

I followed the youtube video here for my setup.

3. Once the above is setup, setup the variables into the web-hook step of the slack workflow:

And use the variables with a helpful note in the following step:

The final workflow will look like this:

4. Send a Slack Notification published to an SNS topic (you can alternately use slack-sdk as well) with job parameters. Here is what the message will look like:

def publish_message(self, bucket_name: str, s3_file: str, subject: str = "Manual Review") -> dict:

    presigned_url = S3.generate_presigned_url(bucket_name, s3_file, expiration=86400)  # 1 day expiration

    message = {
        "approval_link": self.approve_link,
        "rejection_link": self.reject_link,
        "s3_file": presigned_url if presigned_url else s3_file
    }

    logging.info(f"Publishing message to <{self.topic_arn}>, with subject: {subject}, message: {message}")

    response = self.client.publish(
        TopicArn=self.topic_arn,
        Message=json.dumps(message),
        Subject=subject
    )

    logging.info(f"Response: {response}")
    return response

This Lambda sends a Slack message that includes the task token so the function knows what execution to resume.

def send_notification(self):

    # As soon as this message is sent out, this callback lambda will go into a wait state,
    # until an explicit call to this Lambda function execution is triggered.

    # If you don't want this function to wait forever (or the default Steps timeout), ensure you setup
    # an explicit timeout on this
    self.sns.publish_message(self.s3_bucket_name, self.s3_key)

def lambda_handler(event, context):

    config = event["Payload"]["config"]
    task_token = event["Payload"]["taskToken"]  # Step Functions auto-generates this

    reviewer = ManualReview(config, task_token)
    reviewer.send_notification()

5. Once a review notification is received in slack, the user can approve or reject it. The step function goes into a wait state until it receives a user response; however the task token is set to expire in 24 hours, so inactivity will timeout the step function.

Based on whether the user approves or rejects the review request, the rawPath gets set and can be parsed here: code

action = event.get("rawPath", "").strip("/").lower()  
# Extracts 'approve' or 'reject'

The receiving API Gateway + Lambda combo:

  • Parses the Slack payload
  • Extracts taskToken + decision
  • Uses StepFunctions.send_task_success() or send_task_failure()

Example:

match action:
    case "approve":
        output_dict["is_manually_approved"] = True
        response_message = "Approval processed successfully."
    case "reject":
        output_dict["is_manually_rejected"] = True
        response_message = "Rejection processed successfully."
    case _:
        return {
            "statusCode": 400,
            "body": json.dumps({"error": "Invalid action. Use '/approve' or '/reject' in URL."})
        }

...

sfn_client.send_task_success(
    taskToken=task_token,
    output=output
)

Note: Lambda configured with WaitForTaskToken must wait. If you don’t send the token, your workflow just stalls.

Bonus: If you need email or SMS alerts, use SNS to notify a broader group.
Just sns.publish() from within your Lambda or Step Function.

Testing

Once the manual approval system was wired up, it was time to kick the tires. Here’s how I tested it:

  • Right after publishing the slack workflow, I confirmed the SNS subscription before messages get forwarded. Don’t skip this step.
  • Then, I triggered the Step Function manually with a fake payload simulating a data flagging event.
  • When the workflow hit the manual approval step, it sent a Slack message with Approve/Reject buttons.

I tested all major paths:

  • Approve: Clicked Approve — saw the Step Function resume and complete successfully.
  • Reject: Clicked Reject — Step Function moved cleanly into a failure state.
  • Timeout: Ignored the Slack message — Step Function waited for the configured timeout and then gracefully timed out without hanging.

Behind the scenes, I also verified that:

  • The Lambda receiving Slack responses was correctly parsing action payloads.
  • No rogue task tokens were left hanging.
  • Step Functions metrics and Slack error logs were clean.

I highly recommend testing not just happy paths, but also “what if nobody clicks?” and “what if Slack glitches?” — catching these edge cases early saved me headaches later.


Lessons Learned

  • Always use timeouts: Set a timeout both on the WaitForTaskToken step and on the entire Step Function. Without it, workflows can get stuck indefinitely if no one responds.
  • Pass necessary context: If your Step Function needs certain files, paths, or config settings after resuming, make sure you encode and send them along in the SNS notification.
    Step Functions do not automatically retain previous in-memory context when resuming from a Task Token.
  • Manage Slack noise: Be careful about spamming a Slack channel with too many review requests. I recommend creating separate channels for development, UAT, and production flows to keep things clean.
  • Lock down permissions early: Make sure all your AWS resources (Lambda functions, API Gateway, S3 buckets, SNS Topics) have correct and minimal permissions following the principle of least privilege. Where I needed to customize beyond AWS’s defaults, I wrote and posted inline IAM policies as JSON. (You’ll find examples in the GitHub repo).
  • Pre-sign and shorten URLs: If you’re sending links (e.g., to S3 files) in Slack messages, pre-sign the URLs for secure access — and shorten them for a cleaner Slack UI. Here’s a quick example I used:
shorten_url = requests.get(f"http://tinyurl.com/api-create.php?url={presigned_url}").text
default_links[key] = shorten_url if shorten_url else presigned_url

Wrapping Up

Adding human-in-the-loop logic doesn’t have to mean duct tape and cron jobs. With Step Functions + Slack, you can build reviewable, traceable, and production-safe approval flows.

If this helped, or you’re trying something similar, drop a note in the comments! Let’s build better workflows. 

Note: All images in this article were created by the author

The post Pause Your ML Pipelines for Human Review Using AWS Step Functions + Slack appeared first on Towards Data Science.