NetApp, Inc.
Manager, Cloud SRE & Incident Management
2021 – Present
Strategic Thinking
Management of Incident Management and Operational Supportability
In a strategic leadership role, I orchestrated the Incident Management and Operational Supportability processes for Google Cloud NetApp Volumes. This endeavor required me to think strategically about how to streamline and improve our processes. I took the helm in the comprehensive documentation of policies, including case escalation, postmortems, and end-to-end case flow, ensuring a systematic approach to incident management.
In addition, I played a pivotal role in the creation and execution of a Support Implementation Plan (SIP). This strategic document was a roadmap for our operational support, outlining the necessary steps to effectively handle incidents.
Moreover, I meticulously crafted 345 pieces of documentation that covered roll-forward, rollback, and feature flags. This was a strategic move designed to equip our team with robust resources, enabling them to effectively address incidents and maintain operational stability, thereby contributing to the overall strategic goal of delivering reliable and efficient services to our users.
Development and Implementation of the NetApp CRE Program
I spearheaded the strategic inception and development of the inaugural NetApp CRE (Customer Reliability Engineering) program. This initiative required a strategic vision to identify and fill a critical gap in our organization’s structure and operations.
In my role, I was responsible for the strategic recruitment and integration of a team of CREs and Incident Managers. This strategic alignment of resources allowed our senior SREs (Site Reliability Engineers) to focus on enhancing the reliability of our services, a key strategic objective.
Simultaneously, the CRE program established an internal growth trajectory for our junior engineers. This strategic move was designed to foster talent development and retention, promoting from within and thereby ensuring continuity of expertise and knowledge within our team.
Throughout the implementation of this program, I maintained a strong emphasis on customer service, aligning our internal operations with our strategic goal of delivering exceptional service to our customers. This strategic thinking and planning has been instrumental in the success of the NetApp CRE program.
Establishment of First-Party Partnership with Google
In a strategic move that required exceptional expertise and foresight, I successfully participated in the establishment of a first-party partnership between two industry leaders, Google and NetApp. This partnership represented a significant strategic milestone in our company’s growth and market positioning.
This strategic transition required careful planning and execution, and my role was pivotal in ensuring a seamless transition, further solidifying our strategic partnership with Google and enhancing our service offerings to our customers. This strategic accomplishment has had a profound impact on our business operations and strategic direction.
Planning and Execution of the India Google Cloud NetApp Volumes Summit
I took the lead in strategizing and executing the India Google Cloud NetApp Volumes Summit at NetApp. This required strategic thinking to identify relevant topics that would resonate with the audience and contribute to our company’s objectives.
My role involved creating a comprehensive agenda that effectively balanced various themes and considerations. This strategic planning ensured that the summit would be informative, engaging, and valuable for all attendees.
Furthermore, I led interactive classroom sessions, a strategic choice designed to foster engagement, promote discussion, and facilitate a deeper understanding of the topics at hand. This initiative not only showcased my leadership and organizational skills but also highlighted my ability to strategically manage diverse tasks to create a cohesive and successful event.
The strategic planning and execution of the India Google Cloud NetApp Volumes Summit were instrumental in promoting our product, engaging with our audience, and reinforcing our position in the market.
Participation in the 2024 Google/NetApp Summit
In the 2024 Google/NetApp Summit, I played an active role that was marked by strategic engagement and contribution. My participation was geared towards fostering constructive dialogues with the objective of enhancing our strategic partnership, a key factor in our company’s long-term growth and success.
Additionally, I delivered a strategic presentation reviewing the service health of Incident Management & Support since the General Availability (GA) of Google Cloud NetApp Volumes. This presentation was not just a review, but a strategic analysis aimed at identifying areas of strength, opportunities for improvement, and strategies for enhancing service delivery.
Through these actions, I was able to contribute strategically to the summit, providing valuable insights and fostering stronger relations between Google and NetApp, thereby driving our shared strategic objectives forward.
Implementation of Google Cloud NetApp Volumes Support Documentation
I led the strategic implementation of innovative support and incident management processes, procedures, and playbooks for Google Cloud NetApp Volumes. This initiative was a strategic move designed to streamline our operations and improve our service delivery as we transitioned to first-party status.
This strategic implementation required foresight and planning to ensure that the new documentation would effectively guide our team, enhance our service delivery, and align with our new role as a first-party service provider.
The successful implementation of these pioneering processes and procedures not only marked Google Cloud NetApp Volumes’ transition to first-party status but also established a robust framework for our ongoing operations, demonstrating the strategic thinking behind our support and incident management.
Facilitation of NetApp TVC Access
In my role as the primary liaison between NetApp and Google for NetApp TVC Access, I strategically managed the process, ensuring efficient and accurate handling of access requests. This involved meticulous documentation, efficient triage, and accurate tracking – all critical elements of a well-managed process.
The TVC onboarding process posed significant complexities, but my strategic approach allowed me to navigate these challenges effectively. I continuously worked towards smooth operations and effective adaptation to the new system, ensuring that the process was streamlined and efficient.
This strategic facilitation was instrumental in ensuring that the TVC Access process was not a bottleneck but a facilitator of smooth operations between NetApp and Google. By strategically addressing this challenge, I was able to enhance our operational efficiency and strengthen our partnership with Google.
Implementation of Primary/Secondary On-Call Rotation
I strategically designed and implemented the inaugural Primary + Secondary on-call rotation at NetApp for CVS for GCP SRE on-callers. This initiative was a strategic move to separate duties for primary and secondary roles, promoting focused and balanced work efforts during shifts.
The scheduling of the rotation was done with a strategic view to balance the workload and ensure that there was always a dedicated team available to handle any incidents or issues that might arise.
Additionally, I enhanced the warm hand-off processes to ensure a seamless transition of critical information between shifts. This strategic improvement was aimed at minimizing any potential gaps in service or communication, thereby ensuring continuity of service and maintaining high standards of customer service.
This strategic approach to on-call rotation has always kept customer needs at the forefront, ensuring we provide the best possible service at all times. The strategic planning and execution of this initiative have contributed significantly to our operational efficiency and customer satisfaction.
Leadership in Quarterly Roadmap Discussions
I played a central role in the strategic development and leadership of our Quarterly Roadmap discussions. These discussions were a strategic tool designed to set our direction and align team efforts towards shared goals, ensuring that our team was moving in a unified direction and working towards common objectives.
Despite these discussions being discontinued, their impact during their tenure was significant. They served as a strategic platform for team collaboration and planning, fostering a shared understanding of our goals and strategies.
My role in these discussions demonstrated strategic leadership, as I helped guide the team towards our strategic objectives, ensured alignment of efforts, and facilitated effective collaboration. This strategic approach to our Quarterly Roadmap discussions contributed significantly to our team’s cohesion and productivity.
Communication
Holiday Break On-Call Guidance Communication
I efficiently coordinated and disseminated the 2023 Holiday Break On-Call guidance. My goal was to facilitate a smooth holiday shutdown for all stakeholders, with a keen focus on addressing the needs of on-call personnel.
This communication initiative was comprehensive, encompassing the provision of links to all escalation processes, reminders about internal procedures, and the sharing of vital knowledge. I executed this task using a clear and concise communication strategy, ensuring that the information was digestible and accessible to all.
My effective communication of these guidelines ensured that all team members were well-informed and adequately prepared, contributing to a seamless and uninterrupted holiday period. This accomplishment underscores my leadership in communication, demonstrating my ability to effectively convey crucial information and maintain team engagement, even during holiday periods.
Crafting and Communicating the NetApp SRE Support & Communication Policy
Demonstrating my leadership in communication, I collaborated with leaders across the organization to craft, document, and secure senior leadership endorsement for a new NetApp SRE Support and Communication policy.
My role involved not only the development of the policy but also effectively communicating it across various levels of the organization. I ensured the policy was articulated in a manner that was clear and understandable, fostering a comprehensive understanding of the new procedures and expectations among Google Support, NetApp SRE, and NetApp Development Engineering teams.
This policy has streamlined our SRE troubleshooting and escalation process, a testament to the effectiveness of the communication strategies employed. This accomplishment underscores my communication leadership, highlighting my ability to effectively convey information and drive understanding across diverse teams within the organization.
People Development
Career Aspiration & Job Satisfaction Tracking
I took the initiative to single-handedly create a Career Aspirations page for Managers. This tool, designed with the intent to better understand and support our team members, tracks the current workload, job satisfaction, career aspirations, and geographical preferences of their direct reports.
My efforts in this initiative have provided managers with valuable insights into their team members’ workloads, career ambitions, and levels of job satisfaction. This understanding is a vital factor in fostering a supportive and productive work environment.
With this tool, we have been able to more effectively allocate resources, offer tailored development opportunities, and ensure that job roles are not only challenging but also fulfilling. This accomplishment underscores my commitment to people development, showcasing my proactive approach to understanding and addressing the needs of our team members to promote their professional growth and job satisfaction.
Development of Training Programs for Google Support Personnel
I collaborated closely with the Product Management and SRE team to develop impactful training programs for Google Support personnel. These programs were designed with the dual objectives of disseminating essential knowledge and facilitating seamless adoption of Google Cloud NetApp Volumes.
In addition to helping design these training programs, I took the helm in conducting highly successful TSE training sessions focused on this cutting-edge technology. My efforts ensured that Google Support personnel were adequately equipped with the knowledge and skills needed to effectively support this technology.
Active Directory Playbook Documentation
I facilitated the creation of comprehensive Active Directory documentation. This initiative was designed to enhance the knowledge and skills of our team, enabling them to handle Active Directory issues more effectively.
The impact of this initiative is significant as it will lead to a substantial reduction in the number of Active Directory tickets escalated to NetApp, thereby decreasing the resource demands on NetApp’s Site Reliability Engineering team.
Team Building
Promoting Work-Life Balance
Exemplifying my leadership in team building, I’ve been steadfastly committed to fostering a healthy work-life balance within our team. Recognizing the importance of this balance to overall job satisfaction and productivity, I’ve implemented measures to accommodate the needs of our team members across different geographical locations. One such measure was adjusting the timing of our Daily Live Site meetings before and after the transition into and out of Daylight Saving Time. This adjustment was made with the goal of accommodating as many different geographical time zones as possible, thereby ensuring that we provide the best work-life balance we can for all team members.
Furthermore, I implemented a policy of utilizing only the Primary On-Call during holiday breaks, rather than having both a primary and secondary on-call. This approach was based on the observation that holiday periods typically see fewer internal issues due to change freezes, as well as a reduction in customer escalations. By doing so, we were able to give back as much time as possible to our team members, further supporting their work-life balance. These strategic adjustments were designed to respect and value our team members’ time and personal lives, fostering a supportive work environment and strengthening team cohesion. My leadership in this area underscores my commitment to building a team that is not only productive but also balanced and satisfied.
Peer Recognition Survey
In my pursuit of fostering a positive and appreciative team culture, I spearheaded the establishment and communication of the Monthly Peer Recognition Survey. This initiative was designed as a platform for team members to acknowledge their colleagues’ hard work and contributions, thereby fostering a culture of recognition and appreciation within our Site Reliability Engineers (SREs) team.
Every month, I manage the creation of the survey, compile the results, and involve my management peers in the decision-making process. The culmination of this process is the announcement of the deserving winner each month, who receives a dinner voucher as a token of appreciation.
This initiative has not only strengthened the bonds within our team but also created a more connected, appreciative, and positive team environment. By acknowledging and rewarding the hard work of our team members, I’ve helped to foster an atmosphere of mutual respect and recognition, which is crucial for team building and morale.
Team Building Events
Recognizing the importance of interpersonal relationships and camaraderie in a team, I led the organization of a series of team-building outings. These events were designed to bolster work relationships and foster a sense of camaraderie within our team. They served as a platform for team members to interact outside of the professional sphere, enhancing team cohesion, and improving communication.
These outings not only brought joy and a sense of belonging to our team members but also served as a token of appreciation for their hard work and dedication. The noticeable improvement in team dynamics, coupled with the positive feedback received from team members, underscored the success of this initiative.
Through these events, I aimed to create a team environment that values not just professional achievements, but also interpersonal relationships and mutual respect. The success of these team-building events ultimately contributed to increased productivity, improved work satisfaction, and a stronger, more connected team.
Results Oriented
SO, PO, and Backup Support Documentation
Demonstrating my results-oriented leadership, I marshalled resources and gathered subject matter experts to develop in-depth initial SO, PO, and Backup runbook documentation and presentations. This strategic initiative was focused on achieving tangible results, leading to a significant decrease in resource requirements for NetApp’s Site Reliability Engineering team. My results-oriented approach ensured that our efforts were directed towards creating efficient processes, contributing to the overall productivity and effectiveness of our team.
Create Volume Support Documentation
Exemplifying my results-oriented leadership, I led the initiative to develop comprehensive ‘Create Volume’ triage documentation. This initiative was focused on achieving a specific outcome – a substantial decrease in the resource demands on NetApp’s Site Reliability Engineering team. My leadership in this initiative underscores my commitment to driving tangible results that enhance our team’s efficiency and effectiveness.
MTTx Reduction
In my results-oriented leadership role, I spearheaded initiatives to decrease the weekly Mean Time To Resolution (MTTR) of operational tickets by 86% over time. This significant reduction was achieved by enforcing ticket hygiene among all our CREs, SREs, and Incident Managers, and ensuring that documentation for any alert that enters the queue is readily available and accessible. I also led efforts to reduce the volume of Weekly Alerts by 67% over time, accomplished by conducting thorough analyses to identify and minimize noisy, ineffective, or unactionable alerts. These initiatives underline my results-oriented approach, focusing on achieving measurable improvements in our operations.
On-Call Survey
Showcasing my results-oriented leadership, I established the first-ever On-Call Survey at NetApp. This survey was designed to collect feedback from recent on-call participants, with the goal of identifying areas for process improvement and enhancing the overall on-call experience. This results-oriented initiative has empowered our team and boosted morale across the organization, demonstrating the tangible benefits of giving on-callers a voice.
Executive State-of-the-State Dashboard
Exhibiting my results-oriented leadership, I developed the CVS SRE State-of-the-State Executive Dashboard. This comprehensive tool provides high-level data summaries for swift updates and trend analyses, streamlining understanding of the status of various CVS SRE queues. This initiative was focused on achieving a specific result – enhancing the efficiency and effectiveness of our data analysis and reporting processes. This results-oriented approach has contributed to improved decision-making and operational efficiency in our team.
Influence and Persuasion
Strategic Alignment of Engineering Processes with SLOs
I successfully spearheaded an initiative to align Engineering with Service Level Objectives (SLOs) at Google. This accomplishment was showcased through a well-organized and thoroughly documented presentation at the 2024 NetApp Quality Summit in San Jose, CA.
In the presentation, I provided a comprehensive analysis of current P0 escalation metrics, highlighted key successes and areas for improvement over the year, and set forth clear expectations for P0 escalations moving forward. I also emphasized the importance of collaboration in maintaining a successful product at Google.
My ability to persuasively communicate these insights and expectations played a crucial role in aligning our engineering processes with our SLOs. This accomplishment underscores my leadership in using influence and persuasion to drive alignment and foster collaboration.
Outstanding Ticket Tracking by Manager
I developed dedicated tracking pages for Postmortem Action Items, COPS incidents, Runbook Tickets, and Alert Tickets, all organized by SRE managers. This initiative was designed to simplify the tracking of outstanding tickets, providing managers with a convenient overview of their team’s tasks.
By persuasively communicating the benefits of this system, I was able to gain buy-in from managers and their respective direct reports to expedite closures. This process swiftly highlights teams with numerous outstanding items, enabling us to prompt managers and their respective direct reports to expedite closures. As a result, we’ve seen significant improvements in our operational hygiene and the overall health of our Incident Management process.
This accomplishment underscores my leadership in using influence and persuasion to drive process improvements and enhance operational efficiency.
Change Management
Google Buganizer Implementation
I spearheaded the transition from the JIRA GCPSD project to Google’s internal Buganizer tool across our organization. This significant change required careful planning, communication, and execution to minimize disruption and ensure a smooth transition. My leadership in this initiative enhanced our operational efficiency and improved project management within the organization, showcasing the successful management of change to drive organizational improvements.
Change Management Playbook
I developed and documented the first-ever Change Management Playbook for NetApp CVS SRE. This playbook provides a comprehensive guide for managing changes, minimizing disruptions, and maximizing the efficiency of implementing new systems or processes. The development and implementation of this playbook represented a significant change in our internal processes, requiring effective change management to ensure its successful adoption.
Runbook Organization
I cataloged and labeled all our runbooks and How-To documentation and developed an innovative SRE Runbook Portal. This portal, equipped with advanced macros and a search bar, features high-level categories for Runbooks/How-To’s, enabling rapid discovery and access to relevant documents. The implementation of this portal represented a significant change in how our team accessed and utilized documentation, requiring careful change management to ensure its successful adoption and utilization.
Incident Management Playbook
I developed and implemented the first-ever Incident Management Playbook for NetApp CVS SRE. This playbook aligns internal procedures with industry standards and provides a comprehensive guide for incident management, including ticket handling, triage, escalations, and postmortems. The introduction of this playbook represented a significant change in our incident management processes, requiring effective change management to ensure its successful adoption.
StatusPage Templates
I spearheaded a project to develop standardized messaging templates for StatusPage maintenance communications. This change was aimed at providing clear, concise, and relevant information to customers, thereby minimizing the need for follow-up queries and potential push-back. The implementation of these templates represented a significant change in our customer communication processes, requiring careful change management to ensure its successful adoption and utilization.
Postmortem Procedures & Documentation
I led the development of NetApp-specific Internal Postmortem Procedures & Documentation, bringing them in line with industry best practices. This initiative filled a gap in our SRE’s structure, process, and policy, providing a robust framework for handling postmortem reviews. The introduction of these procedures and documentation represented a significant change in our postmortem processes, requiring effective change management to ensure its successful adoption and utilization.
Business Acumen
Driving OPEX Savings through Cloud-Native Transition
Successfully spearheaded a strategic transition from SolarWinds Pingdom to cloud-native tools, including Google Probers and Google Cloud NetApp Volumes, enhancing infrastructure monitoring efficiency. This initiative, backed by a comprehensive analysis with the team, identified the non-requirement of SolarWinds Pingdom, resulting in optimized resource allocation. The successful implementation of this shift led to a significant annual operational expenditure (OPEX) savings of approximately $10,000. This accomplishment showcases strong leadership, team collaboration skills, and a strategic approach to improving operational efficiency.
Proficiency in Google Tools
I showcased proficiency in utilizing a wide range of Google Tools, including Buganizer and IRM. This proficiency enabled me to optimize workflows and drive enhanced productivity, showcasing an understanding of how to leverage technology to drive business results. My ability to effectively use these tools not only improved our team’s efficiency but also contributed to better business outcomes.
Customer-facing RCA Template
I developed a new customer-facing RCA (Root Cause Analysis) template that streamlines root cause documentation. This initiative, designed with a keen understanding of customer needs and expectations, ensures all relevant information is included in the document provided to customers. This initiative has helped reduce post-delivery queries and confusion, thereby improving the customer service provided by CVS SRE. This accomplishment demonstrates my ability to apply business acumen in creating solutions that enhance customer satisfaction and business performance.
PagerDuty Renewal
I successfully led negotiations with PagerDuty, in collaboration with the procurement team. This strategic negotiation resulted in a 4% reduction in the annual licensing renewal cost, leading to a significant cost saving of $4,290 for the company. This accomplishment underscores my strong negotiation skills and commitment to cost-efficiency, demonstrating my ability to apply business acumen to drive financial results for the company.
My Website – AI Ops SRE
Beyond my full-time professional role, I harbor a passion for exploring the dynamic intersection of Artificial Intelligence, Site Reliability Engineering (SRE), and Operations. This passion has manifested in the creation of my website, www.AIOpsSRE.com, a treasure trove of insightful articles and engaging content that delve deep into these fascinating domains.
This platform serves as a space where I share my knowledge, insights, and thought leadership on the transformative power of AI in SRE and Operations. Whether you’re an industry veteran, a curious newcomer, or somewhere in between, you’ll find a wealth of information tailored to your interests and level of expertise.
Here are just a couple of previews of the articles you’ll find on my site. Each one is a labor of love, written and curated in my spare time to feed your curiosity and fuel your passion for AI, SRE, and Operations. So why wait? Dive in and discover a world of knowledge waiting at your fingertips!