Nikhita Kataria
When Nikhita Kataria talks about infrastructure observability, she's not speaking from a non-skin-in-the-game viewpoint, as Nassim Taleb likes to call it. As a Site Reliability Engineer at LinkedIn, she's spent her career buried in logs, metrics, and system behaviour, shaping observability systems not just to monitor, but to improve how core infrastructure functions.
Kataria has built out a series of observability tools and processes that are now part of infrastructure initiatives at LinkedIn. One of her significant achievements includes the integration of a system that auto-generates metrics and alerts for control plane services-components that manage, rather than serve, user-facing data. These services power everything from deployments to configuration changes, making them critical to platform reliability.
Another major area where her work has made an impact is in infrastructure debugging. Kataria developed a Grafana dashboard that traces events across different control plane services, helping engineers pinpoint what action was taken on which data center host and by whom. This cross-service traceability, she says, drastically cut down mean time to debug from days to just hours.
Her work isn't limited to diagnostics. She's also made observability proactive. At LinkedIn, she led the effort to track OS upgrades across all data centers. The dashboard she built provides real-time insight into upgrade progress and failure rates-something that was previously tracked manually or retroactively. This tool is now used daily and serves as a reliable view into infrastructure hygiene.
On the operations front, Kataria worked on alert tuning for LinkedIn's deployment stack, improving the signal-to-noise ratio and helping reduce alert fatigue among engineers. She also created operational review dashboards that bring together key business and system metrics.
Speaking of metrics, the metrics for the successful implementations are clear. Her OS upgrade tracking dashboard is in daily use, showing percentage completion across hosts and helping teams prioritize fix efforts. Her work on creating a Grafana dashboard to trace all events for a data center host across different control plane services is being used by key offline systems powering LinkedIn's website, and by defining SLAs, SLOs, and SLIs for critical control plane services, she's helped teams focus on what matters to users and the SWE's had a clear goal on what are they supposed to improve.
But these activities didn't come without their own set of challenges. One project required unifying log schemas across more than 10 different teams. Each team emitted events in a different format, making it nearly impossible to run meaningful queries or correlate activity. She not only initiated the schema standardization effort but also managed stakeholder alignment, ensuring the unified structure was implemented across the board.
In another initiative-designing disaster recovery metrics-she faced the issue of knowing what would be required to track in the first place. Researching the key indicators needed to monitor both compute and storage health, for which she built a framework for service owners to ensure readiness in critical failover scenarios. A paper based on this work is currently pending publication.
From her standpoint, observability is not a post-launch feature-it's a design principle. "If a software engineer says we can think about metrics later, they are signing up for operational toil later," she says. According to Kataria, thinking about observability from day zero can prevent months of reactive troubleshooting down the line.
Her perspective also takes into account what's next. Artificial intelligence, she says, is going to reshape monitoring-but not without limitations. "People are building AI-powered features, but AI still requires a tight feedback loop to learn the right observability patterns," she explains. Even similar tech stacks across companies like Facebook and LinkedIn generate different monitoring needs, depending on how the software is used in practice. Moreover, AI would need a deep understanding of operating systems to set meaningful thresholds or surface relevant anomalies.
Looking forward, she sees observability becoming more embedded, not as a standalone concern, but as a core pillar for a healthy infrastructure architecture. With her work already influencing how mission-critical services at LinkedIn are built and monitored, she's clearly helping that infrastructure take shape-one metric at a time.