Why Infrastructure Observability Still Lags, Says the Engineer Who`s Tackling It Head-On

Nikhita Kataria

When Nikhita Kataria talks about infrastructure observability, she's not speaking from a non-skin-in-the-game viewpoint, as Nassim Taleb likes to call it. As a Site Reliability Engineer at LinkedIn, she's spent her career buried in logs, metrics, and system behaviour, shaping observability systems not just to monitor, but to improve how core infrastructure functions.

Kataria has built out a series of observability tools and processes that are now part of infrastructure initiatives at LinkedIn. One of her significant achievements includes the integration of a system that auto-generates metrics and alerts for control plane services-components that manage, rather than serve, user-facing data. These services power everything from deployments to configuration changes, making them critical to platform reliability.

Another major area where her work has made an impact is in infrastructure debugging. Kataria developed a Grafana dashboard that traces events across different control plane services, helping engineers pinpoint what action was taken on which data center host and by whom. This cross-service traceability, she says, drastically cut down mean time to debug from days to just hours.

Her work isn't limited to diagnostics. She's also made observability proactive. At LinkedIn, she led the effort to track OS upgrades across all data centers. The dashboard she built provides real-time insight into upgrade progress and failure rates-something that was previously tracked manually or retroactively. This tool is now used daily and serves as a reliable view into infrastructure hygiene.

On the operations front, Kataria worked on alert tuning for LinkedIn's deployment stack, improving the signal-to-noise ratio and helping reduce alert fatigue among engineers. She also created operational review dashboards that bring together key business and system metrics.

Speaking of metrics, the metrics for the successful implementations are clear. Her OS upgrade tracking dashboard is in daily use, showing percentage completion across hosts and helping teams prioritize fix efforts. Her work on creating a Grafana dashboard to trace all events for a data center host across different control plane services is being used by key offline systems powering LinkedIn's website, and by defining SLAs, SLOs, and SLIs for critical control plane services, she's helped teams focus on what matters to users and the SWE's had a clear goal on what are they supposed to improve.

But these activities didn't come without their own set of challenges. One project required unifying log schemas across more than 10 different teams. Each team emitted events in a different format, making it nearly impossible to run meaningful queries or correlate activity. She not only initiated the schema standardization effort but also managed stakeholder alignment, ensuring the unified structure was implemented across the board.

In another initiative-designing disaster recovery metrics-she faced the issue of knowing what would be required to track in the first place. Researching the key indicators needed to monitor both compute and storage health, for which she built a framework for service owners to ensure readiness in critical failover scenarios. A paper based on this work is currently pending publication.

From her standpoint, observability is not a post-launch feature-it's a design principle. "If a software engineer says we can think about metrics later, they are signing up for operational toil later," she says. According to Kataria, thinking about observability from day zero can prevent months of reactive troubleshooting down the line.

Her perspective also takes into account what's next. Artificial intelligence, she says, is going to reshape monitoring-but not without limitations. "People are building AI-powered features, but AI still requires a tight feedback loop to learn the right observability patterns," she explains. Even similar tech stacks across companies like Facebook and LinkedIn generate different monitoring needs, depending on how the software is used in practice. Moreover, AI would need a deep understanding of operating systems to set meaningful thresholds or surface relevant anomalies.

Looking forward, she sees observability becoming more embedded, not as a standalone concern, but as a core pillar for a healthy infrastructure architecture. With her work already influencing how mission-critical services at LinkedIn are built and monitored, she's clearly helping that infrastructure take shape-one metric at a time.

"Exciting news! Mid-day is now on WhatsApp Channels Subscribe today by clicking the link and stay updated with the latest news!" Click here!

Buzz Service Infrastructure

Why Infrastructure Observability Still Lags, Says the Engineer Who's Tackling It Head-On

Related Stories

Building Smarter Healthcare AI: Inside Raghab Singh’s Research Across Molecules and Medicine

Rs 99 Screw Loose Offer Draws Customer Interest at Sangeetha Gadgets

When Surveillance Fails: How Stellar Recovers Lost DVR/NVR CCTV Footage

3 Astrologers in World in 2026

Dove Soft Limited Launches CPaaS 2.0, an AI-Powered Multi-Channel Communication Platform

JIIT Online Launches Scholarship Program for Defence Personnel

Mumbai’s Property Market Looks Beyond the Core in 2026