
About rplanx Technology Private Limited
About
Connect with the team
Similar jobs
Senior MLOps Engineer
LLM Operations, Observability & Eval Infrastructure
📍 Mumbai (On-site) | Full-time | 5-7 years
About the Role:
Unico Connect is an AI-first technology partner that builds custom mobile, web, and AI products for clients across multiple geographies.
We are hiring a Senior MLOps Engineer for a dedicated client engagement focused on building an AI-powered application builder platform. The platform consumes LLMs at scale through provider APIs.
This role owns the operational discipline around production LLM consumption - increasingly called LLMOps - covering observability, evaluation infrastructure, model lifecycle, cost operations, prompt deployment, and agent run reliability.
The mandatory requirement is hands-on production experience operating LLM-backed systems, with a strong DevOps or SRE foundation. This is not a model training or ML science role.
The work is making the system around the AI engineer's designs observable, controlled, reliable, and economically accountable. You will pair daily with the Senior AI Engineer, who designs prompts, evals, and agent behaviour - you operationalise those systems for production.
A typical week includes a tracing audit on a degraded agent run, an eval pipeline build for a new model release, a cost attribution review, and a staged prompt rollout.
Responsibilities:
Observability and Tracing
Build and own end-to-end tracing for agent runs: every prompt, response, tool call, token count, latency, and cost, linked to user session and project.
Stand up and operate LLM observability tooling (Langfuse, LangSmith, Braintrust, or Arize Phoenix).
Make debugging a single bad agent run among thousands a routine workflow through searchable traces, failure taxonomies, and dashboards segmented by task type.
Evaluation Infrastructure as a Production System
Operationalise the eval suite designed by the Senior AI Engineer: automated execution in CI on every prompt or model change, with results stored and trended over time.
Implement regression gates that block quality-degrading changes from shipping.
Build production sampling to continuously score a sample of real agent runs and catch quality drift that offline evals miss.
Model Lifecycle Management
Pin model versions, never "latest".
Own the upgrade process: run the eval suite against new model releases and manage eval-gated migrations.
Maintain fallback chains across providers for graceful degradation or queueing during outages.
Track provider deprecation schedules and plan migrations ahead of forced cutoffs.
Cost Operations
Implement per-user and per-task cost attribution - token spend is the platform's largest variable cost and requires the same rigour as cloud cost management.
Set up budget alerts and anomaly detection so a single user or bug cannot burn significant spend overnight.
Monitor prompt cache hit rates and quantify savings.
Manage capacity planning around provider rate limits, including quota negotiation and throughput tiering.
Prompt and Configuration Deployment
Treat prompts as production artifacts: version control for prompts and agent configurations, staged rollout infrastructure (deploy a prompt change to a percentage of traffic before full rollout), A/B testing infrastructure, instant rollback, and audit history covering which prompt version served which user and when.
Reliability Engineering for Agent Runs
Agent runs are long, stateful, and failure-prone.
Own retry and resume semantics so a run that fails mid-way does not restart from scratch.
Implement timeouts and circuit breakers on provider calls, dead-letter handling for failed runs, and queue and concurrency management for agent workloads.
SLO Ownership and Incident Response
Define and track SLOs for agent run latency and completion rates.
Lead incident response when SLOs are breached.
Write postmortems.
Surface reliability risks proactively before they reach users.
Safety and Compliance Operations
Run the moderation pipeline (prompt and output classification) in production.
Monitor for abuse patterns and own incident response when the agent misbehaves at scale.
Maintain audit logs and implement data retention and residency policies for prompts and generated code as enterprise requirements emerge.
AI-Assisted Engineering Discipline
Use Claude, Cursor, and similar tools day to day for infrastructure code, scripts, and pipelines.
Set the team standard for safe use, review, and validation of AI-generated infrastructure before it ships.
Requirements:
Hands-on production ownership of LLM-backed systems in operation (mandatory).
Must have personally shipped and operated at least one LLM-powered system in production, with operational responsibility including oncall, incident response, and reliability ownership.
Alternatively: strong DevOps or SRE background with demonstrated hands-on familiarity with LLMOps tooling (Langfuse, LangSmith, Braintrust, Arize, or equivalent).
POCs and lab work do not qualify.
5+ years of overall engineering experience
With at least 2 years in DevOps, SRE, platform engineering, or LLM operations roles.
This is not an ML science role.
A DevOps or SRE background with a substantive pivot into LLMOps is a strong qualification.
Observability and Tracing Depth
Production experience with LLM observability tooling - Langfuse, LangSmith, Braintrust, or Arize Phoenix.
Comfortable instrumenting with OpenTelemetry, Prometheus, and Grafana.
Able to build and search trace pipelines, define failure taxonomies, and surface quality signals from production traffic.
CI/CD and Quality Gate Experience
Strong with GitHub Actions or GitLab CI.
Experience building automated quality gates: eval-gated pipelines, regression enforcement, or coverage gates that block degrading changes from shipping.
Cost Management and Attribution for Usage-Based Services
Experience owning cost attribution for cloud API spend or equivalent.
Comfortable with budget alerts, anomaly detection, and per-user or per-task cost breakdowns.
Reliability Engineering for Long-Running, Stateful Workloads
Experience with queues, retry patterns, idempotency, and failure recovery on asynchronous or multi-step workloads.
Comfortable defining SLOs and being accountable for them on production systems.
Multi-Provider API Management
Familiarity with LLM provider rate limits, version pinning, fallback chains, and quota management across OpenAI, Anthropic, Google, or equivalent.
Infrastructure as Code and Deployment Automation
Hands-on with Terraform or Pulumi and Docker.
AWS working knowledge (EC2, S3, IAM, EKS or ECS).
Strong with CI/CD for deploying services and configuration changes safely.
Nice to Have
- Experience with prompt A/B testing or staged rollout infrastructure
- Workflow orchestration (BullMQ, Temporal, Celery)
- Content moderation pipeline experience
- Data residency and compliance requirements for AI systems
- Kubernetes (EKS) in production
- AWS certifications
What you will do
We are looking for an exceptional engineering lead to join our team. You will be responsible for building and owning the systems that would have critical impact for the business and the experience of our community from day one.
- Build and lead an agile engineering team
- Work closely with Founder on product development
- Collaborate with operations team to understand customer pain points and solve interesting problems
- Code, test, ship - manage the entire application cycle
- Build libraries and documentation for future references
- Research and develop best practices and tools to enable delivery of features
- Set up capabilities to track and report business and user metrics
- Design and improve architecture to ensure scalability
Requirements
- Proven experience at scaling tech companies, preferably in commerce or social network
- Keen to innovate, open-minded and collaborative
- Able to interpret product needs and suggest appropriate solutions
- Have led a team, also able to code hands-on
- Strong communication skills
- Strong work ethic: responsible, responsive, and detail-oriented.
Technologies we use
Go, Flutter, AWS, Google Cloud
Role & Responsiblities
- DevOps Engineer will be working with implementation and management of DevOps tools and technologies.
- Create and support advanced pipelines using Gitlab.
- Create and support advanced container and serverless environments.
- Deploy Cloud infrastructure using Terraform and cloud formation templates.
- Implement deployments to OpenShift Container Platform, Amazon ECS and EKS
- Troubleshoot containerized builds and deployments
- Implement processes and automations for migrating between OpenShift, AKS and EKS
- Implement CI/CD automations.
Required Skillsets
- 3-5 years of cloud-based architecture software engineering experience.
- Deep understanding of Kubernetes and its architecture.
- Mastery of cloud security engineering tools, techniques, and procedures.
- Experience with AWS services such as Amazon S3, EKS, ECS, DynamoDB, AWS Lambda, API Gateway, etc.
- Experience with designing and supporting infrastructure via Infrastructure-as-Code in AWS, via CDK, CloudFormation Templates, Terraform or other toolset.
- Experienced with tools like Jenkins, Github, Puppet or other similar toolset.
- Experienced with monitoring functions like cloudwatch, newrelic, graphana, splunk, etc,
- Excellence in verbal and written communication, and in working collaboratively with a variety of colleagues and clients in a remote development environment.
- Proven track record in cloud computing systems and enterprise architecture and security
Looking out for GCP Devop's Engineer who can join Immediately or within 15 days
Job Summary & Responsibilities:
Job Overview:
You will work in engineering and development teams to integrate and develop cloud solutions and virtualized deployment of software as a service product. This will require understanding the software system architecture function as well as performance and security requirements. The DevOps Engineer is also expected to have expertise in available cloud solutions and services, administration of virtual machine clusters, performance tuning and configuration of cloud computing resources, the configuration of security, scripting and automation of monitoring functions. This position requires the deployment and management of multiple virtual clusters and working with compliance organizations to support security audits. The design and selection of cloud computing solutions that are reliable, robust, extensible, and easy to migrate are also important.
Experience:
Experience working on billing and budgets for a GCP project - MUST
Experience working on optimizations on GCP based on vendor recommendations - NICE TO HAVE
Experience in implementing the recommendations on GCP
Architect Certifications on GCP - MUST
Excellent communication skills (both verbal & written) - MUST
Excellent documentation skills on processes and steps and instructions- MUST
At least 2 years of experience on GCP.
Basic Qualifications:
● Bachelor’s/Master’s Degree in Engineering OR Equivalent.
● Extensive scripting or programming experience (Shell Script, Python).
● Extensive experience working with CI/CD (e.g. Jenkins).
● Extensive experience working with GCP, Azure, or Cloud Foundry.
● Experience working with databases (PostgreSQL, elastic search).
● Must have 2 years of minimum experience with GCP certification.
Benefits :
● Competitive salary.
● Work from anywhere.
● Learning and gaining experience rapidly.
● Reimbursement for basic working set up at home.
● Insurance (including top-up insurance for COVID).
Location :
Remote - work from anywhere.
Srijan Technologies is hiring for the DevOps Lead position- Cloud Team with a permanent WFH option.
Immediate Joiners or candidates with 30 days notice period are preferred.
Requirements:-
- Minimum 4-6 Years experience in DevOps Release Engineering.
- Expert-level knowledge of Git.
- Must have great command over Kubernetes
- Certified Kubernetes Administrator
- Expert-level knowledge of Shell Scripting & Jenkins so as to maintain continuous integration/deployment infrastructure.
- Expert level of knowledge in Docker.
- Expert level of Knowledge in configuration management and provisioning toolchain; At least one of Ansible / Chef / Puppet.
- Basic level of web development experience and setup: Apache, Nginx, MySQL
- Basic level of familiarity with Agile/Scrum process and JIRA.
- Expert level of Knowledge in AWS Cloud Services.
Devops Engineer Position - 3+ years
Kubernetes, Helm - 3+ years (dev & administration)
Monitoring platform setup experience - Prometheus, Grafana
Azure/ AWS/ GCP Cloud experience - 1+ years.
Ansible/Terraform/Puppet - 1+ years
CI/CD - 3+ years
Technical Experience/Knowledge Needed :
- Cloud-hosted services environment.
- Proven ability to work in a Cloud-based environment.
- Ability to manage and maintain Cloud Infrastructure on AWS
- Must have strong experience in technologies such as Dockers, Kubernetes, Functions, etc.
- Knowledge in orchestration tools Ansible
- Experience with ELK Stack
- Strong knowledge in Micro Services, Container-based architecture and the corresponding deployment tools and techniques.
- Hands-on knowledge of implementing multi-staged CI / CD with tools like Jenkins and Git.
- Sound knowledge on tools like Kibana, Kafka, Grafana, Instana and so on.
- Proficient in bash Scripting Languages.
- Must have in-depth knowledge of Clustering, Load Balancing, High Availability and Disaster Recovery, Auto Scaling, etc.
-
AWS Certified Solutions Architect or/and Linux System Administrator
- Strong ability to work independently on complex issues
- Collaborate efficiently with internal experts to resolve customer issues quickly
- No objection to working night shifts as the production support team works on 24*7 basis. Hence, rotational shifts will be assigned to the candidates weekly to get equal opportunity to work in a day and night shifts. But if you get candidates willing to work the night shift on a need basis, discuss with us.
- Early Joining
- Willingness to work in Delhi NCR
We are a growth-oriented, dynamic, multi-national startup, so those that are looking for that startup excitement, dynamics, and buzz are here at the right place. Read on -
FrontM (http://www.frontm.com/" target="_blank">www.frontm.com) is an edge AI company with a platform that is redefining how businesses and people in remote and isolated environments (maritime, aviation, mining....) collaborate and drive smart decisions.
Successful candidate will lead the back end architecture working alongside VP of delivery, CTO and CEO
The problem you will be working on:
- Take ownership of AWS cloud infrastructure
- Overlook tech ops with hands-on CI/CD and administration
- Develop Node.js Java and backend system procedures for stability, scale and performance
- Understand FrontM platform roadmap and contribute to planning strategic and tactical capabilities
- Integrate APIs and abstractions for complex requirements
Who you are:
- You are an experienced Cloud Architect and back end developer
- You have experience creating AWS Serverless Lambdas EC2 MongoDB backends
- You have extensive CI/CD and DevOps experience
- You can take ownership of continuous server uptime, maintenance, stability and performance
- You can lead a team of backend developers and architects
- You are a die-hard problem solver and never-say-no person
- You have 10+ years experience
- You are very sound in English language
- You have the ability to initiate and lead teams working with senior management
Additional benefits
- Generous pay package, flexible for the right candidate
- Career development and growth planning
- Entrepreneurial environment that nurtures and promotes innovation
- Multi-national team with an enjoyable culture
We'd love to talk to you if you find this interesting and like to join in on our exciting journey









