
About Mobigossip Infotech Pvt Ltd
About
Connect with the team
Company social profiles
Similar jobs
Senior MLOps Engineer
LLM Operations, Observability & Eval Infrastructure
📍 Mumbai (On-site) | Full-time | 5-7 years
About the Role:
Unico Connect is an AI-first technology partner that builds custom mobile, web, and AI products for clients across multiple geographies.
We are hiring a Senior MLOps Engineer for a dedicated client engagement focused on building an AI-powered application builder platform. The platform consumes LLMs at scale through provider APIs.
This role owns the operational discipline around production LLM consumption - increasingly called LLMOps - covering observability, evaluation infrastructure, model lifecycle, cost operations, prompt deployment, and agent run reliability.
The mandatory requirement is hands-on production experience operating LLM-backed systems, with a strong DevOps or SRE foundation. This is not a model training or ML science role.
The work is making the system around the AI engineer's designs observable, controlled, reliable, and economically accountable. You will pair daily with the Senior AI Engineer, who designs prompts, evals, and agent behaviour - you operationalise those systems for production.
A typical week includes a tracing audit on a degraded agent run, an eval pipeline build for a new model release, a cost attribution review, and a staged prompt rollout.
Responsibilities:
Observability and Tracing
Build and own end-to-end tracing for agent runs: every prompt, response, tool call, token count, latency, and cost, linked to user session and project.
Stand up and operate LLM observability tooling (Langfuse, LangSmith, Braintrust, or Arize Phoenix).
Make debugging a single bad agent run among thousands a routine workflow through searchable traces, failure taxonomies, and dashboards segmented by task type.
Evaluation Infrastructure as a Production System
Operationalise the eval suite designed by the Senior AI Engineer: automated execution in CI on every prompt or model change, with results stored and trended over time.
Implement regression gates that block quality-degrading changes from shipping.
Build production sampling to continuously score a sample of real agent runs and catch quality drift that offline evals miss.
Model Lifecycle Management
Pin model versions, never "latest".
Own the upgrade process: run the eval suite against new model releases and manage eval-gated migrations.
Maintain fallback chains across providers for graceful degradation or queueing during outages.
Track provider deprecation schedules and plan migrations ahead of forced cutoffs.
Cost Operations
Implement per-user and per-task cost attribution - token spend is the platform's largest variable cost and requires the same rigour as cloud cost management.
Set up budget alerts and anomaly detection so a single user or bug cannot burn significant spend overnight.
Monitor prompt cache hit rates and quantify savings.
Manage capacity planning around provider rate limits, including quota negotiation and throughput tiering.
Prompt and Configuration Deployment
Treat prompts as production artifacts: version control for prompts and agent configurations, staged rollout infrastructure (deploy a prompt change to a percentage of traffic before full rollout), A/B testing infrastructure, instant rollback, and audit history covering which prompt version served which user and when.
Reliability Engineering for Agent Runs
Agent runs are long, stateful, and failure-prone.
Own retry and resume semantics so a run that fails mid-way does not restart from scratch.
Implement timeouts and circuit breakers on provider calls, dead-letter handling for failed runs, and queue and concurrency management for agent workloads.
SLO Ownership and Incident Response
Define and track SLOs for agent run latency and completion rates.
Lead incident response when SLOs are breached.
Write postmortems.
Surface reliability risks proactively before they reach users.
Safety and Compliance Operations
Run the moderation pipeline (prompt and output classification) in production.
Monitor for abuse patterns and own incident response when the agent misbehaves at scale.
Maintain audit logs and implement data retention and residency policies for prompts and generated code as enterprise requirements emerge.
AI-Assisted Engineering Discipline
Use Claude, Cursor, and similar tools day to day for infrastructure code, scripts, and pipelines.
Set the team standard for safe use, review, and validation of AI-generated infrastructure before it ships.
Requirements:
Hands-on production ownership of LLM-backed systems in operation (mandatory).
Must have personally shipped and operated at least one LLM-powered system in production, with operational responsibility including oncall, incident response, and reliability ownership.
Alternatively: strong DevOps or SRE background with demonstrated hands-on familiarity with LLMOps tooling (Langfuse, LangSmith, Braintrust, Arize, or equivalent).
POCs and lab work do not qualify.
5+ years of overall engineering experience
With at least 2 years in DevOps, SRE, platform engineering, or LLM operations roles.
This is not an ML science role.
A DevOps or SRE background with a substantive pivot into LLMOps is a strong qualification.
Observability and Tracing Depth
Production experience with LLM observability tooling - Langfuse, LangSmith, Braintrust, or Arize Phoenix.
Comfortable instrumenting with OpenTelemetry, Prometheus, and Grafana.
Able to build and search trace pipelines, define failure taxonomies, and surface quality signals from production traffic.
CI/CD and Quality Gate Experience
Strong with GitHub Actions or GitLab CI.
Experience building automated quality gates: eval-gated pipelines, regression enforcement, or coverage gates that block degrading changes from shipping.
Cost Management and Attribution for Usage-Based Services
Experience owning cost attribution for cloud API spend or equivalent.
Comfortable with budget alerts, anomaly detection, and per-user or per-task cost breakdowns.
Reliability Engineering for Long-Running, Stateful Workloads
Experience with queues, retry patterns, idempotency, and failure recovery on asynchronous or multi-step workloads.
Comfortable defining SLOs and being accountable for them on production systems.
Multi-Provider API Management
Familiarity with LLM provider rate limits, version pinning, fallback chains, and quota management across OpenAI, Anthropic, Google, or equivalent.
Infrastructure as Code and Deployment Automation
Hands-on with Terraform or Pulumi and Docker.
AWS working knowledge (EC2, S3, IAM, EKS or ECS).
Strong with CI/CD for deploying services and configuration changes safely.
Nice to Have
- Experience with prompt A/B testing or staged rollout infrastructure
- Workflow orchestration (BullMQ, Temporal, Celery)
- Content moderation pipeline experience
- Data residency and compliance requirements for AI systems
- Kubernetes (EKS) in production
- AWS certifications
Job Summary:
We are seeking a Principal DevSecOps Engineer/Architect with deep expertise in Terraform, Atlantis, and large-scale multi-cloud environments (200–300+ accounts/subscriptions). This is a client-facing consulting role, responsible for designing and governing secure, scalable infrastructure and CI/CD platforms across enterprise environments.
This role goes beyond implementation, you will define standards, governance models, and automation frameworks for large organizations operating at scale.
Key Responsibilities:
Cloud Platforms / AWS Expertise
- AWS (Expert) – Organizations, IAM, VPC, Security services
- Strong hands-on experience with AWS Service Catalog to design, publish, and manage standardized, secure infrastructure products across multi-account AWS environments
- Proven ability to enforce governance, compliance, and cost controls by integrating Service Catalog with IAM, SCPs, and CI/CD pipelines
- Experience with Azure or GCP (multi-cloud exposure required) Enterprise DevSecOps Architecture
- Architect and standardize end-to-end DevSecOps platforms
- Design secure, scalable CI/CD pipelines using Jenkins & GitHub Actions
- Embed security gates (SAST, DAST, SCA, container security) into pipelines
- Define reusable pipeline templates across multiple teams/business units Terraform + Atlantis (Must-Have, Core Focus)
- Design and manage large-scale Terraform architecture across 200–300+ cloud accounts
- Implement Terraform modules, remote backends, and state isolation strategies
- Build and manage Atlantis workflows for automated Terraform plan/apply
- Enforce:Code reviews & approvals for infra changes Policy-as-code (OPA / Sentinel)
- Drift detection & remediation
Multi-Cloud Architecture (AWS + Azure/GCP)
- Architect multi-cloud landing zones and governance frameworks
- Manage large account structures:
- AWS Organizations (multi-account strategy) Azure Management Groups / Subscriptions
- Implement:Network segmentation (VPC/VNet design) Identity federation (SSO, IAM, RBAC) Cross-account access models
- Ensure high availability, scalability, and cost optimization Security & Compliance at Scale
- Implement DevSecOps controls aligned with SOC 2, PCI-DSS, GDPR
- Integrate tools like: JFrog Xray (SCA) SonarQube (SAST) Trivy / Prisma / Wiz (container & cloud security)
- Build policy-as-code frameworks for compliance enforcement
- Automate evidence collection for audits
Artifact & Dependency Management
- Architect secure artifact lifecycle using JFrog Artifactory
- Implement access control, immutability, and vulnerability scanning
- Standardize dependency management across teams Observability & Reliability
- Implement centralized logging/monitoring:
- CloudWatch, ELK, Prometheus/Grafana
- Define SLOs/SLIs for platform reliability
- Reduce MTTR via automation and alerting
Consulting & Leadership
- Act as a trusted advisor to enterprise clients
- Lead architecture discussions and DevSecOps transformation programs
- Mentor teams and enforce engineering best practices
- Drive platform adoption across multiple business units
Required Skills:
Infrastructure as Code (Core)
- Terraform (Expert level)
- Atlantis (Hands-on implementation at scale)
- Strong experience with multi-account architecture (200–300+ accounts)
DevOps Tooling
- Jenkins (advanced pipelines)
- GitHub Actions (enterprise workflows)
- GitOps practices
Cloud Platforms
- AWS (Expert) – Organizations, IAM, VPC, Security services
- Experience with Azure or GCP (multi-cloud exposure required)
Security Stack
- SAST, DAST, SCA tools integration
- Container security & Kubernetes security
- Secrets management (Vault / AWS Secrets Manager)
Programming/Scripting
- Python / Bash (automation focus)
Preferred Qualifications:
- Experience with Kubernetes (EKS/AKS/GKE) at scale
- Knowledge of Zero Trust Architecture
- Experience with OPA / Sentinel (policy-as-code)
- Familiarity with platform engineering concepts (Internal Developer Platforms)
Certifications (Good to Have)
- AWS Solutions Architect – Professional
- Terraform Associate / Advanced Terraform certifications
- CISSP / CKS (Kubernetes Security)
What Makes This Role Premium
- Ownership of large-scale (200–300 account) cloud environments
- Direct impact on enterprise DevSecOps maturity
- Client-facing architecture & strategy role (not just execution)
- Opportunity to define organization-wide standards
About the Job
This is a full-time role for a Lead DevOps Engineer at Spark Eighteen. We are seeking an experienced DevOps professional to lead our infrastructure strategy, design resilient systems, and drive continuous improvement in our deployment processes. In this role, you will architect scalable solutions, mentor junior engineers, and ensure the highest standards of reliability and security across our cloud infrastructure. The job location is flexible with preference for the Delhi NCR region.
Responsibilities
- Lead and mentor the DevOps/SRE team
- Define and drive DevOps strategy and roadmaps
- Oversee infrastructure automation and CI/CD at scale
- Collaborate with architects, developers, and QA teams to integrate DevOps practices
- Ensure security, compliance, and high availability of platforms
- Own incident response, postmortems, and root cause analysis
- Budgeting, team hiring, and performance evaluation
Requirements
Technical Skills
- Bachelor's or Master's degree in Computer Science, Engineering, or related field.
- 7+ years of professional DevOps experience with demonstrated progression.
- Strong architecture and leadership background
- Deep hands-on knowledge of infrastructure as code, CI/CD, and cloud
- Proven experience with monitoring, security, and governance
- Effective stakeholder and project management
- Experience with tools like Jenkins, ArgoCD, Terraform, Vault, ELK, etc.
- Strong understanding of business continuity and disaster recovery
Soft Skills
- Cross-functional communication excellence with ability to lead technical discussions.
- Strong mentorship capabilities for junior and mid-level team members.
- Advanced strategic thinking and ability to propose innovative solutions.
- Excellent knowledge transfer skills through documentation and training.
- Ability to understand and align technical solutions with broader business strategy.
- Proactive problem-solving approach with focus on continuous improvement.
- Strong leadership skills in guiding team performance and technical direction.
- Effective collaboration across development, QA, and business teams.
- Ability to make complex technical decisions with minimal supervision.
- Strategic approach to risk management and mitigation.
What We Offer
- Professional Growth: Continuous learning opportunities through diverse projects and mentorship from experienced leaders
- Global Exposure: Work with clients from 20+ countries, gaining insights into different markets and business cultures
- Impactful Work: Contribute to projects that make a real difference, with solutions generating over $1B in revenue
- Work-Life Balance: Flexible arrangements that respect personal wellbeing while fostering productivity
- Career Advancement: Clear progression pathways as you develop skills within our growing organization
- Competitive Compensation: Attractive salary packages that recognize your contributions and expertise
Our Culture
At Spark Eighteen, our culture centers on innovation, excellence, and growth. We believe in:
- Quality-First: Delivering excellence rather than just quick solutions
- True Partnership: Building relationships based on trust and mutual respect
- Communication: Prioritizing clear, effective communication across teams
- Innovation: Encouraging curiosity and creative approaches to problem-solving
- Continuous Learning: Supporting professional development at all levels
- Collaboration: Combining diverse perspectives to achieve shared goals
- Impact: Measuring success by the value we create for clients and users
Apply Here - https://tinyurl.com/t6x23p9b
- Design cloud infrastructure that is secure, scalable, and highly available on AWS, Azure and GCP
- Work collaboratively with software engineering to define infrastructure and deployment requirements
- Provision, configure and maintain AWS, Azure, GCP cloud infrastructure defined as code
- Ensure configuration and compliance with configuration management tools
- Administer and troubleshoot Linux based systems
- Troubleshoot problems across a wide array of services and functional areas
- Build and maintain operational tools for deployment, monitoring, and analysis of AWS, Azure Infrastructure and systems
- Perform infrastructure cost analysis and optimization
Position Overview: We are seeking a talented and experienced Cloud Engineer specialized in AWS cloud services to join our dynamic team. The ideal candidate will have a strong background in AWS infrastructure and services, including EC2, Elastic Load Balancing (ELB), Auto Scaling, S3, VPC, RDS, CloudFormation, CloudFront, Route 53, AWS Certificate Manager (ACM), and Terraform for Infrastructure as Code (IaC). Experience with other AWS services is a plus.
Responsibilities:
• Design, deploy, and maintain AWS infrastructure solutions, ensuring scalability, reliability, and security.
• Configure and manage EC2 instances to meet application requirements.
• Implement and manage Elastic Load Balancers (ELB) to distribute incoming traffic across multiple instances.
• Set up and manage AWS Auto Scaling to dynamically adjust resources based on demand.
• Configure and maintain VPCs, including subnets, route tables, and security groups, to control network traffic.
• Deploy and manage AWS CloudFormation and Terraform templates to automate infrastructure provisioning using Infrastructure as Code (IaC) principles.
• Implement and monitor S3 storage solutions for secure and scalable data storage
• Set up and manage CloudFront distributions for content delivery with low latency and high transfer speeds.
• Configure Route 53 for domain management, DNS routing, and failover configurations.
• Manage AWS Certificate Manager (ACM) for provisioning, managing, and deploying SSL/TLS certificates.
• Collaborate with cross-functional teams to understand business requirements and provide effective cloud solutions.
• Stay updated with the latest AWS technologies and best practices to drive continuous improvement.
Qualifications:
• Bachelor's degree in computer science, Information Technology, or a related field.
• Minimum of 2 years of relevant experience in designing, deploying, and managing AWS cloud solutions.
• Strong proficiency in AWS services such as EC2, ELB, Auto Scaling, VPC, S3, RDS, and CloudFormation.
• Experience with other AWS services such as Lambda, ECS, EKS, and DynamoDB is a plus.
• Solid understanding of cloud computing principles, including IaaS, PaaS, and SaaS.
• Excellent problem-solving skills and the ability to troubleshoot complex issues in a cloud environment.
• Strong communication skills with the ability to collaborate effectively with cross-functional teams.
• Relevant AWS certifications (e.g., AWS Certified Solutions Architect, AWS Certified DevOps Engineer, etc.) are highly desirable.
Additional Information:
• We value creativity, innovation, and a proactive approach to problem-solving.
• We offer a collaborative and supportive work environment where your ideas and contributions are valued.
• Opportunities for professional growth and development. Someshwara Software Pvt Ltd is an equal opportunity employer.
We celebrate diversity and are dedicated to creating an inclusive environment for all employees.
Required qualifications and must have skills
-
5+ years of experience managing a team of 5+ infrastructure software engineers
-
5+ years of experience in building and scaling technical infrastructure
-
5+ years of experience in delivering software
-
Experience leading by influence in multi-team, cross-functional projects
-
Demonstrated experience recruiting and managing technical teams, including performance management and managing engineers
-
Experience with cloud service providers such as AWS, GCP, or Azure
-
Experience with containerization technologies such as Kubernetes and Docker
Nice to have Skills
-
Experience with Hadoop, Hive and Presto
-
Application/infrastructure benchmarking and optimization
-
Familiarity with modern CI/CD practices
-
Familiarity with reliability best practices
DevOps Engineer
KNOLSKAPE is looking for a DevOps Engineer to help us build Educational platforms and products that make learning experiential for leaders of the world.
DevOps Engineer responsibilities include deploying product updates, identifying production issues, and implementing integrations that meet customer needs. If you have a solid background in working with cloud technologies, setting up efficient deployment processes, and are motivated to work with diverse and talented teams, we’d like to meet you.
Ultimately, you will execute and automate operational processes fast, accurately, and securely.
Skills and Experience
- 2+ years of experience in building infrastructure experience with Cloud Providers ( AWS / Azure / GCP)
- Build and Deployment Management (Gitlab).
- Experience in writing automation scripts using Shell, Python, and Terraform based.
- Good experience in building pipelines with YAML-based knowledge of the GitLab environment.
- System Administration skill set.
- Docker/Kubernetes container infrastructure and orchestration
- Deploying/operating NodeJs/PHP/LAMP framework-based clusters with infrastructure.
- Monitoring, metrics collection, and distributed tracing
- Infrastructure as code” – Experience with Terraform preferred.
- Strong AWS Deployment Experience
- Provide system-level technical support
- Desire to learn new technologies while supporting existing
Roles and Responsibilities
- End to End-building CI/CD pipelines using tools like Jenkins and Jenkins Pipelines etc.
- Build CI/CD pipelines to orchestrate provisioning and deployment of both large scale systems
- Develop and implement instrumentation for monitoring the health and availability of services including fault detection, alerting, triage, and recovery (automated and manual)
- Develop, improve, and thoroughly document operational practices and procedures.
- Perform tasks related to securing and keeping the products, tools, and processes you are responsible for securing our infrastructure.
- Agile software development practices
- Understand IT processes, including architecture, design, implementation, and operations
- Open Source development experience
- Self-motivated, able and willing to help where help is needed
- Able to build relationships, be culturally sensitive, have goal alignment, have learning agility
Location: Bangalore
About KNOLSKAPE
KNOLSKAPE is an end-to-end learning and assessment platform for accelerated employee development. Our core belief is that desired business outcomes are achieved best when learning needs are aligned with business requirements, but traditional methodologies for capability development require a new, more updated approach. Keeping with this philosophy, we offer engaging, immersive, and experiential learning and assessment solutions - strategy cascading, business acumen, change management, leadership pipeline, digital capabilities, and talent assessments. Leveraging a blended omnichannel delivery model, KNOLSKAPE offers instructor-led classroom sessions, live virtual sessions, and self-paced courses to suit every learning need.
More than 300 clients in 25 countries have benefited from KNOLSKAPE's award-winning experiential solutions. A 120+ strong team based out of offices in Singapore, India, Malaysia, and the USA serves a rapidly growing global client base across industries such as banking and finance, consulting, IT, FMCG, retail, manufacturing, infrastructure, pharmaceuticals, engineering, auto, government and academia.
KNOLSKAPE is a global Top 20 gamification company, recipient of numerous Brandon Hall awards, and has been recognized as a company to watch for in the Talent Management Space, by Frost & Sullivan, and as a disruptor in the learning space, by Bersin by Deloitte.
The DevOps Engineer's core responsibilities include automated configuration and management
of infrastructure, continuous integration and delivery of distributed systems at scale in a Hybrid
environment.
Must-Have:
● You have 4-10 years of experience in DevOps
● You have experience in managing IT infrastructure at scale
● You have experience in automation of deployment of distributed systems and in
infrastructure provisioning at scale.
● You have in-depth hands-on experience on Linux and Linux-based systems, Linux
scripting
● You have experience in Server hardware, Networking, firewalls
● You have experience in source code management, configuration management,
continuous integration, continuous testing, continuous monitoring
● You have experience with CI/CD and related tools
* You have experience with Monitoring tools like ELK, Grafana, Prometheus
● You have experience with containerization, container orchestration, management
● Have a penchant for solving complex and interesting problems.
● Worked in startup-like environments with high levels of ownership and commitment.
● BTech, MTech or Ph.D. in Computer Science or related Technical Discipline
- Good experience in AWS services like Elastic Compute Cloud(EC2), IAM, RDS, API Gateway, Cognito, etc.
- Using GIT, SonarQube, Ansible, Nexus, Nagios, etc.
- Strong experience in creating, importing and launching volumes with security groups, auto-scaling, Load Balancers, Fault-tolerant
- Experience in configuring Jenkins job with related Plugins for Building, Testing, and Continuous Deployment to accomplish the complete CI/CD.
- As a DevOps Engineer, you need to have strong experience in CI/CD pipelines.
- Setup development, testing, automation tools, and IT infrastructure
- Defining and setting development, test, release, update, and support processes for DevOps operation
- Selecting and deploying appropriate CI/CD tools
- Deploy and maintain CI/CD pipelines across multiple environments (Mobile, Web API’s & AIML)
Required skills & experience:
- 3+ years of experience as DevOps Engineer and strong working knowledge in CI/CD pipelines
- Experience administering and deploying development CI/CD using Git, BitBucket, CodeCommit, Jira, Jenkins, Maven, Gradle, etc
- Strong knowledge in Linux-based infrastructures and AWS/Azure/GCP environment
- Working knowledge on AWS (IAM, EC2, VPC, ELB, ALB, Autoscaling, Lambda, etc)
- Experience with Docker containerization and clustering (Kubernetes/ECS)
- Experience on Android source(AOSP) clone, build, and automation ecosystems
- Knowledge of scripting languages such as Python, Shell, Groovy, Bash, etc
- Familiar with Android ROM development and build process
- Knowledge of Agile Software Development methodologies












