Technical Experience/Knowledge Needed :
- Cloud-hosted services environment.
- Proven ability to work in a Cloud-based environment.
- Ability to manage and maintain Cloud Infrastructure on AWS
- Must have strong experience in technologies such as Dockers, Kubernetes, Functions, etc.
- Knowledge in orchestration tools Ansible
- Experience with ELK Stack
- Strong knowledge in Micro Services, Container-based architecture and the corresponding deployment tools and techniques.
- Hands-on knowledge of implementing multi-staged CI / CD with tools like Jenkins and Git.
- Sound knowledge on tools like Kibana, Kafka, Grafana, Instana and so on.
- Proficient in bash Scripting Languages.
- Must have in-depth knowledge of Clustering, Load Balancing, High Availability and Disaster Recovery, Auto Scaling, etc.
-
AWS Certified Solutions Architect or/and Linux System Administrator
- Strong ability to work independently on complex issues
- Collaborate efficiently with internal experts to resolve customer issues quickly
- No objection to working night shifts as the production support team works on 24*7 basis. Hence, rotational shifts will be assigned to the candidates weekly to get equal opportunity to work in a day and night shifts. But if you get candidates willing to work the night shift on a need basis, discuss with us.
- Early Joining
- Willingness to work in Delhi NCR

About Opoyi Inc
About
Connect with the team
Company social profiles
Similar jobs
Senior MLOps Engineer
LLM Operations, Observability & Eval Infrastructure
📍 Mumbai (On-site) | Full-time | 5-7 years
About the Role:
Unico Connect is an AI-first technology partner that builds custom mobile, web, and AI products for clients across multiple geographies.
We are hiring a Senior MLOps Engineer for a dedicated client engagement focused on building an AI-powered application builder platform. The platform consumes LLMs at scale through provider APIs.
This role owns the operational discipline around production LLM consumption - increasingly called LLMOps - covering observability, evaluation infrastructure, model lifecycle, cost operations, prompt deployment, and agent run reliability.
The mandatory requirement is hands-on production experience operating LLM-backed systems, with a strong DevOps or SRE foundation. This is not a model training or ML science role.
The work is making the system around the AI engineer's designs observable, controlled, reliable, and economically accountable. You will pair daily with the Senior AI Engineer, who designs prompts, evals, and agent behaviour - you operationalise those systems for production.
A typical week includes a tracing audit on a degraded agent run, an eval pipeline build for a new model release, a cost attribution review, and a staged prompt rollout.
Responsibilities:
Observability and Tracing
Build and own end-to-end tracing for agent runs: every prompt, response, tool call, token count, latency, and cost, linked to user session and project.
Stand up and operate LLM observability tooling (Langfuse, LangSmith, Braintrust, or Arize Phoenix).
Make debugging a single bad agent run among thousands a routine workflow through searchable traces, failure taxonomies, and dashboards segmented by task type.
Evaluation Infrastructure as a Production System
Operationalise the eval suite designed by the Senior AI Engineer: automated execution in CI on every prompt or model change, with results stored and trended over time.
Implement regression gates that block quality-degrading changes from shipping.
Build production sampling to continuously score a sample of real agent runs and catch quality drift that offline evals miss.
Model Lifecycle Management
Pin model versions, never "latest".
Own the upgrade process: run the eval suite against new model releases and manage eval-gated migrations.
Maintain fallback chains across providers for graceful degradation or queueing during outages.
Track provider deprecation schedules and plan migrations ahead of forced cutoffs.
Cost Operations
Implement per-user and per-task cost attribution - token spend is the platform's largest variable cost and requires the same rigour as cloud cost management.
Set up budget alerts and anomaly detection so a single user or bug cannot burn significant spend overnight.
Monitor prompt cache hit rates and quantify savings.
Manage capacity planning around provider rate limits, including quota negotiation and throughput tiering.
Prompt and Configuration Deployment
Treat prompts as production artifacts: version control for prompts and agent configurations, staged rollout infrastructure (deploy a prompt change to a percentage of traffic before full rollout), A/B testing infrastructure, instant rollback, and audit history covering which prompt version served which user and when.
Reliability Engineering for Agent Runs
Agent runs are long, stateful, and failure-prone.
Own retry and resume semantics so a run that fails mid-way does not restart from scratch.
Implement timeouts and circuit breakers on provider calls, dead-letter handling for failed runs, and queue and concurrency management for agent workloads.
SLO Ownership and Incident Response
Define and track SLOs for agent run latency and completion rates.
Lead incident response when SLOs are breached.
Write postmortems.
Surface reliability risks proactively before they reach users.
Safety and Compliance Operations
Run the moderation pipeline (prompt and output classification) in production.
Monitor for abuse patterns and own incident response when the agent misbehaves at scale.
Maintain audit logs and implement data retention and residency policies for prompts and generated code as enterprise requirements emerge.
AI-Assisted Engineering Discipline
Use Claude, Cursor, and similar tools day to day for infrastructure code, scripts, and pipelines.
Set the team standard for safe use, review, and validation of AI-generated infrastructure before it ships.
Requirements:
Hands-on production ownership of LLM-backed systems in operation (mandatory).
Must have personally shipped and operated at least one LLM-powered system in production, with operational responsibility including oncall, incident response, and reliability ownership.
Alternatively: strong DevOps or SRE background with demonstrated hands-on familiarity with LLMOps tooling (Langfuse, LangSmith, Braintrust, Arize, or equivalent).
POCs and lab work do not qualify.
5+ years of overall engineering experience
With at least 2 years in DevOps, SRE, platform engineering, or LLM operations roles.
This is not an ML science role.
A DevOps or SRE background with a substantive pivot into LLMOps is a strong qualification.
Observability and Tracing Depth
Production experience with LLM observability tooling - Langfuse, LangSmith, Braintrust, or Arize Phoenix.
Comfortable instrumenting with OpenTelemetry, Prometheus, and Grafana.
Able to build and search trace pipelines, define failure taxonomies, and surface quality signals from production traffic.
CI/CD and Quality Gate Experience
Strong with GitHub Actions or GitLab CI.
Experience building automated quality gates: eval-gated pipelines, regression enforcement, or coverage gates that block degrading changes from shipping.
Cost Management and Attribution for Usage-Based Services
Experience owning cost attribution for cloud API spend or equivalent.
Comfortable with budget alerts, anomaly detection, and per-user or per-task cost breakdowns.
Reliability Engineering for Long-Running, Stateful Workloads
Experience with queues, retry patterns, idempotency, and failure recovery on asynchronous or multi-step workloads.
Comfortable defining SLOs and being accountable for them on production systems.
Multi-Provider API Management
Familiarity with LLM provider rate limits, version pinning, fallback chains, and quota management across OpenAI, Anthropic, Google, or equivalent.
Infrastructure as Code and Deployment Automation
Hands-on with Terraform or Pulumi and Docker.
AWS working knowledge (EC2, S3, IAM, EKS or ECS).
Strong with CI/CD for deploying services and configuration changes safely.
Nice to Have
- Experience with prompt A/B testing or staged rollout infrastructure
- Workflow orchestration (BullMQ, Temporal, Celery)
- Content moderation pipeline experience
- Data residency and compliance requirements for AI systems
- Kubernetes (EKS) in production
- AWS certifications
Please Apply - https://zrec.in/7EYKe?source=CareerSite
About Us
Infra360 Solutions is a services company specializing in Cloud, DevSecOps, Security, and Observability solutions. We help technology companies adapt DevOps culture in their organization by focusing on long-term DevOps roadmap. We focus on identifying technical and cultural issues in the journey of successfully implementing the DevOps practices in the organization and work with respective teams to fix issues to increase overall productivity. We also do training sessions for the developers and make them realize the importance of DevOps. We provide these services - DevOps, DevSecOps, FinOps, Cost Optimizations, CI/CD, Observability, Cloud Security, Containerization, Cloud Migration, Site Reliability, Performance Optimizations, SIEM and SecOps, Serverless automation, Well-Architected Review, MLOps, Governance, Risk & Compliance. We do assessments of technology architecture, security, governance, compliance, and DevOps maturity model for any technology company and help them optimize their cloud cost, streamline their technology architecture, and set up processes to improve the availability and reliability of their website and applications. We set up tools for monitoring, logging, and observability. We focus on bringing the DevOps culture to the organization to improve its efficiency and delivery.
Job Description
Job Title: Senior DevOps Engineer / SRE
Department: Technology
Location: Gurgaon
Work Mode: On-site
Working Hours: 10 AM - 7 PM
Terms: Permanent
Experience: 4-6 years
Education: B.Tech/MCA
Notice Period: Immediately
About Us
At Infra360.io, we are a next-generation cloud consulting and services company committed to delivering comprehensive, 360-degree solutions for cloud, infrastructure, DevOps, and security. We partner with clients to transform and optimize their technology landscape, ensuring resilience, scalability, cost efficiency and innovation.
Our core services include Cloud Strategy, Site Reliability Engineering (SRE), DevOps, Cloud Security Posture Management (CSPM), and related Managed Services. We specialize in driving operational excellence across multi-cloud environments, helping businesses achieve their goals with agility and reliability.
We thrive on ownership, collaboration, problem-solving, and excellence, fostering an environment where innovation and continuous learning are at the forefront. Join us as we expand and redefine what’s possible in cloud technology and infrastructure.
Role Summary
We are seeking a Senior DevOps Engineer (SRE) to manage and optimize large-scale, mission-critical production systems. The ideal candidate will have a strong problem-solving mindset, extensive experience in troubleshooting, and expertise in scaling, automating, and enhancing system reliability. This role requires hands-on proficiency in tools like Kubernetes, Terraform, CI/CD, and cloud platforms (AWS, GCP, Azure), along with scripting skills in Python or Go. The candidate will drive observability and monitoring initiatives using tools like Prometheus, Grafana, and APM solutions (Datadog, New Relic, OpenTelemetry).
Strong communication, incident management skills, and a collaborative approach are essential. Experience in team leadership and multi-client engagement is a plus.
Ideal Candidate Profile
- Solid 4-6 years of experience as an SRE and DevOps with a proven track record of handling large-scale production environments
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field
- Strong Hands-on experience with managing Large Scale Production Systems
- Strong Production Troubleshooting Skills and handling high-pressure situations.
- Strong Experience with Databases (PostgreSQL, MongoDB, ElasticSearch, Kafka)
- Worked on making production systems more Scalable, Highly Available and Fault-tolerant
- Hands-on experience with ELK or other logging and observability tools
- Hands-on experience with Prometheus, Grafana & Alertmanager and on-call processes like Pagerduty
- Problem-Solving Mindset
- Strong with skills - K8s, Terraform, Helm, ArgoCD, AWS/GCP/Azure etc
- Good with Python/Go Scripting Automation
- Strong with fundamentals like DNS, Networking, Linux
- Experience with APM tools like - Newrelic, Datadog, OpenTelemetry
- Good experience with Incident Response, Incident Management, Writing detailed RCAs
- Experience with Applications best practices in making apps more reliable and fault-tolerant
- Strong leadership skills and the ability to mentor team members and provide guidance on best practices.
- Able to manage multiple clients and take ownership of client issues.
- Experience with Git and coding best practices
Good to have
- Team-leading Experience
- Multiple Client Handling
- Requirements gathering from clients
- Good Communication
Key Responsibilities
- Design and Development:
- Architect, design, and develop high-quality, scalable, and secure cloud-based software solutions.
- Collaborate with product and engineering teams to translate business requirements into technical specifications.
- Write clean, maintainable, and efficient code, following best practices and coding standards.
- Cloud Infrastructure:
- Develop and optimise cloud-native applications, leveraging cloud services like AWS, Azure, or Google Cloud Platform (GCP).
- Implement and manage CI/CD pipelines for automated deployment and testing.
- Ensure the security, reliability, and performance of cloud infrastructure.
- Technical Leadership:
- Mentor and guide junior engineers, providing technical leadership and fostering a collaborative team environment.
- Participate in code reviews, ensuring adherence to best practices and high-quality code delivery.
- Lead technical discussions and contribute to architectural decisions.
- Problem Solving and Troubleshooting:
- Identify, diagnose, and resolve complex software and infrastructure issues.
- Perform root cause analysis for production incidents and implement preventative measures.
- Continuous Improvement:
- Stay up-to-date with the latest industry trends, tools, and technologies in cloud computing and software engineering.
- Contribute to the continuous improvement of development processes, tools, and methodologies.
- Drive innovation by experimenting with new technologies and solutions to enhance the platform.
- Collaboration:
- Work closely with DevOps, QA, and other teams to ensure smooth integration and delivery of software releases.
- Communicate effectively with stakeholders, including technical and non-technical team members.
- Client Interaction & Management:
- Will serve as a direct point of contact for multiple clients.
- Able to handle the unique technical needs and challenges of two or more clients concurrently.
- Involve both direct interaction with clients and internal team coordination.
- Production Systems Management:
- Must have extensive experience in managing, monitoring, and debugging production environments.
- Will work on troubleshooting complex issues and ensure that production systems are running smoothly with minimal downtime.
- Bachelor of Computer Science or Equivalent Education
- At least 5 years of experience in a relevant technical position.
- Azure and/or AWS experience
- Strong in CI/CD concepts and technologies like GitOps (Argo CD)
- Hands-on experience with DevOps Tools (Jenkins, GitHub, SonarQube, Checkmarx)
- Experience with Helm Charts for package management
- Strong in Kubernetes, OpenShift, and Container Network Interface (CNI)
- Experience with programming and scripting languages (Spring Boot, NodeJS, Python)
- Strong container image management experience using Docker and distroless concepts
- Familiarity with Shared Libraries for code reuse and modularity
- Excellent communication skills (verbal, written, and presentation)
Note: Looking for immediate joiners only.
Candidate must have a minimum of 8+ years of IT experience
IST time zone.
candidates should have hands-on experience in
DevOps (GitLab, Artifactory, SonarQube, AquaSec, Terraform & Docker / K8")
Thanks & Regards,
Anitha. K
TAG Specialist
Role – Devops
Experience 3 – 6 Years
Roles & Responsibilities –
- 3-6 years of experience in deploying and managing highly scalable fault resilient systems
- Strong experience in container orchestration and server automation tools such as Kubernetes, Google Container Engine, Docker Swarm, Ansible, Terraform
- Strong experience with Linux-based infrastructures, Linux/Unix administration, AWS, Google Cloud, Azure
- Strong experience with databases such as MySQL, Hadoop, Elasticsearch, Redis, Cassandra, and MongoDB.
- Knowledge of scripting languages such as Java, JavaScript, Python, PHP, Groovy, Bash.
- Experience in configuring CI/CD pipelines using Jenkins, GitLab CI, Travis.
- Proficient in technologies such as Docker, Kafka, Raft and Vagrant
- Experience in implementing queueing services such as RabbitMQ, Beanstalkd, Amazon SQS and knowledge in ElasticStack is a plus.
Implementing various development, testing, automation tools, and IT infrastructure
Planning the team structure, activities, and involvement in project management activities.
Managing stakeholders and external interfaces
Setting up tools and required infrastructure
Defining and setting development, test, release, update, and support processes for DevOps operation
Have the technical skill to review, verify, and validate the software code developed in the project.
Troubleshooting techniques and fixing the code bugs
Monitoring the processes during the entire lifecycle for its adherence and updating or creating new processes for improvement and minimizing the wastage
Encouraging and building automated processes wherever possible
Identifying and deploying cybersecurity measures by continuously performing vulnerability assessment and risk management
Incidence management and root cause analysis
Coordination and communication within the team and with customers
Selecting and deploying appropriate CI/CD tools
Strive for continuous improvement and build continuous integration, continuous development, and constant deployment pipeline (CI/CD Pipeline)
Mentoring and guiding the team members
Monitoring and measuring customer experience and KPIs
Managing periodic reporting on the progress to the management and the customer
We are having an excellent job opportunity for the position for AWS Infra Architect for one of the reputed Multinational Company at Hyderabad.
Mandate Skills : Please find the below expectations
- We need at-least 3+ years of experience as an Architect in AWS Primary Skills
- Designing, Planning, Implementation , Providing the solutions in Designing the Architecture
- Automation Using Terraform / Powershell /Python
- Should have good experience in Cloud formation Templates
- Experience in Cloudwatch
- Security in AWS
- Strong Linux Administration skills
MTX Group Inc. is seeking a motivated DevOps Engineer to join our team. MTX Group Inc is a global cloud implementation partner that enables organizations to become a fit enterprise through digital transformation and strategy. MTX is powered by the Maverick.io Artificial Intelligence platform and has a strong presence in the Public Sector providing proprietary designs and innovative concept accelerators around licensing and permitting, inspections, grants management, case management, and program management. MTX is a strategic partner with Salesforce with specialty expertise in Einstein Analytics, Mulesoft, Customer Community, Commerce Cloud, and Marketing Cloud. MTX is a Google Cloud partner helping accelerate digital transformation programs across federal, state, and local government agencies.
The DevOps role is responsible for maintaining infrastructure and both development and operational deployments in multiple cloud environments for MTX Group, Inc. and their clients. This role adheres to and promotes MTX Group, Inc’s company’s values by performing respective duties in a manner that supports and contributes to the achievement of MTX Group, Inc’s company’s goals.
Responsibilities:
- Develop and manage tools and services to be used by the organization and by external users of the platform
- Automate all operational and repetitive tasks to improve efficiency and productivity of all development teams
- Research and propose new solutions to improve the the mavQ platform in aspects of speed, scalability and security
- Automate and manage the cloud infrastructure of the organization distribute across the globe and across multiple cloud providers such as Google Cloud and AWS
- Ensure thorough logging, monitoring and alerting for all services and code running in the organization
- Work with development teams to communications and protocols for distributes microservices
- Help development teams debug devops related issues
- Manage CI/CD, Source Control and IAM for the organization
What you will bring:
- Bachelor’s Degree or equivalent
- 4+ years of experience as a DevOps Engineer OR
- 2+ years of experience as backend developer and 2+ years of experience as DevOps or Systems engineer
- Hands on experience with Docker and Kubernetes
- Thorough understanding of operating systems and networking
- Theoretical and practical understanding of Infrastructure-as-code and Platform-as-a-service concepts
- Ability to understand and work with any service, tool or API as needed
- Ability to understand implementation of open source products and modify them if necessary
- Ability to visualize large scale distributed systems and debug issues or make changes to said systems
- Understanding and practical experience in managing CI/CD
What we offer:
- A competitive salary on par with top market standards
- Group Medical Insurance (Family Floater Plan - Self + Spouse + 2 Dependent Children)
- Sum Insured: INR 5,00,000/-
- Maternity cover upto two children
- Inclusive of COVID-19 Coverage
- Cashless & Reimbursement facility
- Access to free online doctor consultation
- Personal Accident Policy (Disability Insurance) -
- Sum Insured: INR. 25,00,000/- Per Employee
- Accidental Death and Permanent Total Disability is covered up to 100% of Sum Insured
- Permanent Partial Disability is covered as per the scale of benefits decided by the Insurer
- Temporary Total Disability is covered
- An option of Paytm Food Wallet (up to Rs. 2500) as a tax saver benefit
- Monthly Internet Reimbursement of upto Rs. 1,000
- Opportunity to pursue Executive Programs/ courses at top universities globally
- Professional Development opportunities through various MTX sponsored certifications on multiple technology stacks including Salesforce, Google Cloud, Amazon & others
***********************
Required Skills and Experience
- 4+ years of relevant experience with DevOps tools Jenkins, Ansible, Chef etc
- 4+ years of experience in continuous integration/deployment and software tools development experience with Python and shell scripts etc
- Building and running Docker images and deployment on Amazon ECS
- Working with AWS services (EC2, S3, ELB, VPC, RDS, Cloudwatch, ECS, ECR, EKS)
- Knowledge and experience working with container technologies such as Docker and Amazon ECS, EKS, Kubernetes
- Experience with source code and configuration management tools such as Git, Bitbucket, and Maven
- Ability to work with and support Linux environments (Ubuntu, Amazon Linux, CentOS)
- Knowledge and experience in cloud orchestration tools such as AWS Cloudformation/Terraform etc
- Experience with implementing "infrastructure as code", “pipeline as code” and "security as code" to enable continuous integration and delivery
- Understanding of IAM, RBAC, NACLs, and KMS
- Good communication skills
Good to have:
- Strong understanding of security concepts, methodologies and apply them such as SSH, public key encryption, access credentials, certificates etc.
- Knowledge of database administration such as MongoDB.
- Knowledge of maintaining and using tools such as Jira, Bitbucket, Confluence.
- Work with Leads and Architects in designing and implementation of technical infrastructure, platform, and tools to support modern best practices and facilitate the efficiency of our development teams through automation, CI/CD pipelines, and ease of access and performance.
- Establish and promote DevOps thinking, guidelines, best practices, and standards.
- Contribute to architectural discussions, Agile software development process improvement, and DevOps best practices.











