LIVIN' ON THE EDGE PODCAST

S3 Ep5: DevOps & AI Unveiled: Exposing 'You Build It, You Run It' Myth

Livin' On the Edge: Ambassador Labs · S3 Ep5: The Evolution of DevOps & AI: Why "You Build It, You Run It" is a Lie feat. Natan Yellin

SUBSCRIBE:

About

In a revealing episode of the "Livin' On the Edge" podcast, host Jake Beck engages with Natan Yellin, CEO of Robusta, to dive into the complexities and evolution of DevOps, particularly when intertwined with AI. This discussion critically examines the traditional DevOps mantra "You Build It, You Run It," challenging its relevance and application in the modern software development landscape.

Episode Guests

Natan Yelln

CEO of Robusta

Natan graduated from the Israel Institute of Technology with a Dual major in Computer Science and Physics. Since then, he was a software developer and researcher for the Government of Israel then an engineer at Alcide.io, and now is the CEO of Robusta, which helps developers respond faster by automating Kubernetes monitoring. Robusta tracks errors like crashing pods or Prometheus alerts, enriches them, and enables one-click remediations. It turns siloed knowledge into automated runbooks with YAML.

Key Highlights of This Episode:

Debunking the DevOps Mantra: Natan Yellin articulates the impracticality of expecting developers to both build and operate their software, likening it to expecting a mechanical engineer who makes screws for airplanes to also fly the plane. This analogy underscores the distinct skill sets required for development and operations, highlighting the potential chaos in conflating these roles.
Empowering Developers: Yellin advocates for a shift in the DevOps culture, emphasizing the importance of developers understanding the entire software lifecycle. This understanding motivates them to write cleaner code, implement effective error handling, and proactively address potential issues, balancing the focus between new features and stability.
Role of AI in DevOps: AI emerges as a vital tool in supporting the 'build and run' approach, offering developers insights into their code's performance, identifying bottlenecks, and addressing issues proactively. Yellin stresses that AI, contrary to replacing human programmers, complements them by filling gaps in lifecycle management.
Human and AI Collaboration: The conversation acknowledges the human-like attributes of AI in creativity and performing imprecise tasks, while noting its limitations in logical reasoning and common sense. This balance between AI-driven automation and human intervention remains a key challenge and opportunity in evolving DevOps practices.
Redefining the Approach: The discussion concludes with a nuanced understanding of "You Build It, You Run It," suggesting a collaborative model where AI aids developers in managing and troubleshooting software, enhancing innovation and efficiency rather than replacing human input.

Transcript

00:00.00
ambassadorlabs
Oh, wait a second, Jake, you're muted there. You go? Okay, now you're a good deal. No, no, no, no, we heard you're good now.

00:06.25
Jake Beck
Am I a good net? Was I muted the whole time?

00:08.76
Natan
Um, I was there. I mean, you were talking to me, and I was hearing you all.

00:16.40
Jake Beck
Okay, hello, everyone, and welcome back to another Living on the Edge episode. I'm here with Naton yelling at the CEO at Robusta. That's about all I'm going to go into for the background. I am still determining if you want to take it away and give us a little bit of introduction.

00:32.60
Natan
Yeah, sure. So, um, I come from a software engineering background. I was a software developer who mostly did cyber security for the past ten years and then eventually moved over from more of the classical network security staff. Um, in no more cornered stuff to cloud security, and I got to know Kubernetes firsthand, and then today at Robusta, so I'm ni of one of the co-founders, and what we do is we have a platform that helps busy DeVops and sre teams and platform teams. Off that die have the work that they do to support developers troubleshooting issues in production, looking at their application from the silly stuff like where are my logs to the complicated things like there's a peculiar Kubernetes issue or how do I troubleshoot that so we help those teams that that developer for self-service and solve those problems on their own with a self-service platform.

01:27.95
Jake Beck
That's super interesting. I think I thought it was funny that you mentioned logs because, honestly, every time I go to look for Kubernetes Logs, I'm like, oh God, where do I start it?

01:34.40
Natan
Yeah, so it's weird because it's either like in them or depending on the organization. You don't either like the dogs are really accessible, but it's just a massive pain to copy-paste pot like the way you query them is not so accessible or make.

01:46.28
Jake Beck
Um, and.

01:51.35
Jake Beck
Ah.

01:52.93
Natan
The dogs are super accessible, but the devs, by design, don't have access to them because it's a product, and then you want to give them access to like stuff on their application but not to other things. So it's a wide variety of weird, like enterprise stuff, for why developers need that type of empowerment plus.

01:58.22
Jake Beck
Um.

02:10.83
Natan
The broad thing is, like, developers don't often have that Kubernetes expertise, and they just want to say, okay, where's my app? Here's all the stuff for my app. What do I do? Is there this year? Yes, no, and, like, get back to development.

02:21.78
Jake Beck
They want to see the logs, almost unlike if they were doing local development where it's just like right there. It's like, okay, I know exactly what's wrong, not having to dig through all the Kubernetes stuff to like if you're not in Kubernetes like using even like Kube control all the time, right? It's like. Okay, I have to refresh all these commands and try 15 times before getting close to what I want. So.

02:45.68
Natan
So I'll give you so so I'll give it. This is a good transition transition. Maybe to the topic of the podcast as well. And I'll give you an analogy: literature. I mean, we always see you build it. You're on it. But I want to give an analogy that will sound insane.

02:57.53
Jake Beck
Yeah.

03:03.76
Natan
Not the software industry. So imagine that you have an engineer, and yeah, he has some background in metal working or whatever, like Gamem of engineering, and he works at the factory. He builds screws, and there are these really fantastic industrial grey screws that used to screw into stuff, and they hold it tight now. Someone takes those screws, and they assemble that into an airplane, and the screw is one the person of that airplane, and then they put the engineer in the cocktail of that airplane, and they say okay you build, you run it, or he didn't build the whole airplane I mean he built the screw right? Like it's not a pilot.

03:35.37
Jake Beck
A.

03:39.32
Natan
And you see him down an airplane. There are all the panels and all the gadgets and the dodads and whatever, and you say okay, like fly the plane and not just that, but you give it to him like only when you say it? Tim, it's not like it's even mid there, and you hold like the wheel and driver, though. No, you give it to Tim, right? An instant when crashing and you say you fly a plane and play and fix it. I know that that often is what's happening now with cloud native, where you have developers who are really good at building applications, and they know Java, and they early knew see a year ago or whatever they have really good application knowledge. Um, but they don't. So, they need to become more familiar with the convoluted stack in the cloud sometimes. Um, even though you're now not just running your application in Kubernetes, which I love, You're running your application, which is getting secrets from the vault, right? And there's probably an API gateway there, and there's so much Ora service mesh, right? There are sidecars, and it's running fine inside the container. You get back files that are really easy, and it's running inside a pod. Okay, you kind of get pads and interrupt the site inside, and deployment, and that's connected to Kubernetes service running connect to the service mesh, and that's running on some no, and there could be a node issue, and then you get something like you hp maxed out or like some. Ah, dirt or something. I'm throwing out terms, but you get some of it because that's how it feels. Well, say, okay, well, as you build it. You run it. You're in charge of that and go and handle it, and where it often breaks down is then people go and they just like they say the DevOps tension they're like okay like and what do I do, and people want to be empowered like people want to.

04:56.30
Jake Beck
Ahead.

05:12.54
Natan
Own stuff. Um, you want to have ownership if you're that guy who built the screw and you get to be in the Airblane You see what it's doing in action like that's really cool and like it's exciting to be part of the technology that went into building an airplane and all other stuff. It's really interesting to learn, but you don't want to get shoved into that cockpit when the thing's scratching. Have to deal with all this stuff that you have no expertise in and then not have like and and and have a sink or swim attitude.

05:38.40
Jake Beck
Yeah, no, I think that makes a lot of sense just right off the bat; I kind of was just like, well, I disagree with you, and then you kept going, and I was like, okay, like. It makes sense, right? Like, especially if you think about it at scale, like your bill that you run, it really doesn't work at scale like large scale, like maybe if you have a really small application where it's just a really simple stack. That's fine, but like you said, as you continue to grow and, you get into something that is like. As good and amazing as Kubernetes is, it's extremely complicated to fully understand and have a breadth of knowledge in it. It is not something you can easily do while also writing super good code all the time and running a production. Continually so writing production-level code while also still trying to have like a vast understanding of this tech stack that's running in some Cloud environment. So I think you sold me on that where it's like.

06:37.55
Natan
So I'll tell you how you built it. You run a ten-person startup or a 20-person startup, and you have phenomenal engineers, and they get all the cloud stuff, and they built it, and you often don't have that extra layer of complexity.

06:50.47
Jake Beck
A.

06:52.30
Natan
Okay, maybe more for secrets, but you probably don't have an API gateway or a service special. You're not at that level of maturity yet anyway, so in these small startups or in smaller companies, hundred people. Two hundred people have like a really core team, and, um, you're only hiring people who are really up-to-date, and you don't have a legacy application.

06:55.94
Jake Beck
Right.

07:10.53
Natan
You don't have a lot of other stuff that you're carrying with you. It's all new stock, and it's all new and known Kubernetes from day one, and everyone knows it like okay, and in those places, it kind of works. But if you look at Fortune 500, who's doing your bill that you are on? Take Google, right? They don't do you build it. You run it. They have some teams. So.

07:13.35
Jake Beck
Um, yeah.

07:28.17
Jake Beck
Um, you know.

07:29.26
Natan
We have one Sre for every ten developers, and Sre is kind of running the operations on it, right? Sure, the developers are involved. I'm not saying we go back to what we did thirty years ago, um, it's like build it, throw it over the wall. Give it to it. It runs, and I'm not saying you should do that either. But you but. We're starting to see now. What's interesting is we're starting to see now. It's emerged as a movement, the pushback almost to you build it. You're on it, and it's emerged as a movement that people are starting to say, you know what, here you have, like on one side, you have built something through all voices of the wall to it. It owns it. There's no ownership. No one likes people who have to do things they don't actually have the knowledge to do. Whatever, it doesn't work over here. You have on the opposite end of the spectrum you have you build it, you on it, I mean, and it's sink or swim as you build it. You're on it. You take it there like if you look at the original DevOps and faster, and you look at how people thought about DevOps, anyone who developed the job title DeVops is not doing DevOps. DevOps was supposed to be developers doing ops. So if you have a company and like in your company, you have jobs in DevOps where you're not doing DevOps as it was urged. Andvis. Course you're doing DevOps. I mean, that's what DevOps has become, and you're doing important work, and as I would argue, you're actually filming a really important role, but you're not doing your build it. You run it, so what the industry has kind of done is to push back. It is not really on you to build it. You run it because no one's been doing it anyway. All the industry has done.

08:50.28
Jake Beck
Um, and then.

08:59.83
Natan
To push back on saying you build it. You run. It is just the whole solution, and now it's under the banner of platform engineering, and people are essentially saying okay, like you just give you, you just give it to it. It doesn't work. You just see a developer, Synncry Square, and it doesn't work. So you build it. You run it, but you get enablement, and you have to, like, you get support from these platform teams, and you're enabled, and we care about developer experience. It's not we shove everything in the world onto you.

09:29.69
Jake Beck
Yeah, no, I think that makes a ton of sense, especially since you've started to talk about the Fortune 500, right? It's like so many of those companies have built out internal platforms to mask so much of the background of what's. A lot of times, it's just Kubernetes, right? But they built these platforms that you go in, and you deploy your docker image or whatever, and you have access to your logs and like your metrics to your specific container, but you don't see what's going on in the background and I like I think you said it like perfect there. It's just like. We have the DevOps roles and everything, right? They aren't what they originally envisioned, and I think that's a perfect example. It's basically been folded into platform engineering as it's continued to emerge as its own thing in the industry.

10:21.35
Natan
So it's fun.

10:24.66
Jake Beck
And a question for you then, like, how are you doing it differently than within Robusta? I'm very curious about what you guys are doing differently in that aspect.

10:33.53
Natan
So what we do is we give companies ranging from startups to very large enterprises running Kubernetes on-premise or like our government. If we give companies like that, I mean, who today is supporting these? Have a DevOps team that is constantly fielding support queries essentially from developers. We give them a single pane of Vas that they roll out to their developers, and then the developer goes in. He just sees his application, and he sees the dogs. He sees the memory and sees how everything needs to be seen. He can take certain actions. Um. And he sees the information that's relevant for him, and it's oriented around the way a developer thinks, meaning if you look at the way. Ah, like, ah, if you look at the way ops thinks, then like, you have, um, almost different subsystems, right? Like that's, say, you have a pod that's also scaled by the HPA. And it's running on a node somewhere, and then there are logs. All those things would be like different dashboards almost because they're like different subsystems. It's a different griff on the template, like the dash template dashboard. Whatever. But if you look at the way the developer thinks he's not thinking about it, it's just my application; I wouldn't see everything related to my application on 1 page. Um, so what we do is.

11:35.10
Jake Beck
Um, and.

11:48.69
Natan
We give a platform on top of your existing observability data, and we're built with the tight integration here on top like Prometheus and sectorial metrics; anything that's Prometheus like and we give a platform on top of that where you can go in and each developer just sees a page with the stuff that matters for him in firing alerts and issues and things that need his attention. And the other cost problems like inefficiencies that you should deal with but all made simple in a way that the developer doesn't have all the expertise can really understand me, so conceptually a deal almost like an internal developer platform but really oriented around like starting with what goes on on day-to-day basis in production. That top from engineers and DevOps teams is like fielding support, and that's oriented around day negative. One. How am I going to scaffold on a new application?

12:35.40
Jake Beck
Okay, that's really interesting, so it's definitely more geared towards, like, not the like, really the platform engineer specifically. It's more geared towards the people writing the. Whatever that cloud app is or whatever, they can see it in a way that makes sense without having to really know Kubernetes. I would assume it's more of like abstracting the confusion but the complexity that Kubernetes really is.

12:53.86
Natan
Exactly.

13:02.94
Natan
But with two distinctions: one. You don't have to redeploy your software via our platform. So we're not like some of the stuff out there. That's around this will say okay like we're giving you a platform now you deploy everything our right like we don't do that so we just look at what's already out there.

13:12.16
Jake Beck
Um.

13:22.64
Natan
And then we give you a portal on top of that with 0 configurations because if you think about all the men, The data is actually already out there like all the services are already running that you shouldn't have to change all your Yaml files and deploy something with some weird CRD I give you like you would never do that first of all like why should you like all the stuff is already out there. It's being deployed.

13:28.50
Jake Beck
Um, right.

13:40.50
Natan
Um, me redeploying it via Homel chart, whatever, but like being able to look at what's already there and then say okay, like this is one app, this is another app, this is another app here, show that person and stuff for him that he cares about and then the second distinction is we are brought in um, for ah developers. But we're. Almost always brought in my platform teams, and a big part of the value that we're bringing them in the way that like this is often justified in organizations that we can dramatically reduce the number of like tickets that they're getting from developers so we see like reductions of like 50% in the support tickets and they then have to go in the field.

14:18.64
Jake Beck
No, it's. I think that's really interesting. It's brought in essentially by the platform engineers to make their jobs easier by making the developer's jobs easier, kind of right, like a little bit.

14:29.96
Natan
Well, yeah, it's it's interesting because it's like there's 2. It's a mold. There are two major stakeholders, and you're kind of building for both of them. You're building for developers. But you're also building for the platform teams because they're the ones who are giving the budget, and they're the ones who actually.

14:37.64
Jake Beck
Is it?

14:47.31
Natan
It is benefiting most day-to-day because now they free up all this time.

14:51.43
Jake Beck
Yeah, for sure, I guess, kind of on that topic of monitoring all that. I guess what is like, um, your platform's built around like essentially monitoring everything like what? What do you think are some of the like? Best practices that you help implement that a lot of people might not be doing or doing incorrectly right now.

15:06.98
Natan
So the first one I would say is to be application-centric, and no matter how you do it, no matter whatever the technology, you want to give application developers something that's oriented around what they think of and what they care about, so they should not have to like. Go to one dashboard to see what certain aspects related to their application, like running in the cloud, and then go to another dashboard to see a different one. A good example of this is when you typically go to 1 dashboard to see stuff on the HPA, right? And how was being all their scales? But then there's a different dashboard for your like pods, and then there's different like all those things are part of 1 theistic picture like this is my app here's its eml um, here's when it was last changed and rolled out. Um, here is, like, what's going on right now? Here are the others who fired on this. Giving that single pain of class for what an application developer actually cares about because the application developer does not know each of the good frontend dashboards that's out there, and we're yet to go, and we're yet to look like the data is all there. It just isn't organized in a way that promotes troubleshooting work, though, for developers. So that's the first tip and then.

16:12.69
Jake Beck
And.

16:18.28
Natan
Um, the second tip is to try to drive people to action, like everything you're showing should drive them to action. I mean, an action can be like an action that you're taking that you're going back and you're changing something in get or you're taking like there's something that's. Stuck in some issues, some transient issue, and you like to delete upon it fixes it. Maybe that's common on Kubernetes, but you can have actions like that, and you can have actions like I wrong-size my resource. I overallocated it by 200%; I should downsize it. So think about how you build something that drives people to action and doesn't just. Throw up like a wall of information that, then again, people don't always know what to do with, so we, like, when people are confused, they do nothing to drive people to action. Um, and I could go. I think those are the two main ones.

17:02.97
Jake Beck
Yeah, I love that because, well, I love the actual one because we've all just had annoying alerts or whatever that we're like, ah, that one flaps all the time. We don't need to do anything about it, and it's like, that's such. Such bad practice is so common because it's like, well, if the alert's just flapping or something and you ignore it, you should fix the alert, or maybe it doesn't need an action, and you should take that action, right? So it's even, like, I think, even a flappy alert like that is an action item, right? Like you should do something about it.

17:35.71
Natan
Yes.

17:39.13
Jake Beck
And so many of us are just like we are. We have better things to do or whatever it is, and we'll just learn to ignore it and then when something bad actually happens. You're just so used to ignoring it. So I love what you say about the action thing. I Love controversial takes, so let's hear it.

17:51.34
Natan
Um, let me say something controversial. Okay, um, I love getups. I think the philosophy behind Gipps is fantastic. I think it's great that you have an audit log of everything. It can happen if that happens.

17:59.00
Jake Beck
A.

18:06.18
Natan
Like, I think it's eaten. Great. You can roll back, I think, which is great. You can command changes with a pull request on GitHub. Gitubs is phenomenal and a terrible fit, though, for handling some aspects of learning and thresholds because it does not drive you to action if you have an inert and you get that and there in stack. And you say it's that same day, another day. I have been inspired every single day in the past, like I've known since I've been at the company. It's been buying, and no one has ever gone in science, and I don't even know which gets value to go with to go modified to change that that does not drive to action and that that is how everyone is deploying their Prometheus alerts. Um. And it's a terrible way of doing it. Yeah, like if you want to manage templates for those alerts and get fantastic. But really, when it comes to thresholds and sincing, stuff like that is not something that makes sense to handle. And yeah, I know that's a controversial take, and I know many people would disagree with me on that. Um, and product-wise, like we. Do help now with managing brought throughout something with managing Prometheus alerts. Um, for those applications, we are taking a non-git apps approach from day one, and then we'll be adding a git apps-compatible approach on day two because there are people with demand. No one's going to change how they work because of why I that um it. But. We see an eye have the trouble that causes when people have those, and it's causing a ton of noise, and people do not go in sign and say um, and it's even worse when you look again at like how companies are working you build it. You are running like this, which is how companies actually work. You have a DevOps engineer or a platform engineer who defines some of their own infrastructure as it.

19:42.88
Natan
Like a default alert from Kube Prometheus stack that's running, and that's firing, and the developer whose application is firing on often has no clue what you would even do to make that other not come, and it's the way that, like the way, the responsibilities are set the way. They even have permissions to go on, like which repository you'd have to go and edit like the entire pattern is built towards the eventual outcome that the developers ignore all the Prometheus effort alerts they get, and that doesn't work really well.

20:12.53
Jake Beck
Right? So What? What do you recommend for maintaining those then because, like, the benefit of the Gitops, right? It is like anyone can go in, and like you can see it. It's saved in whatever Github Gitlab. Whatever you use, like what? What do you? like? What's your approach then instead of tops for that?

20:29.87
Natan
So, where we want to be, and we're getting there. We're almost there, but where we want to be and what we want to be able to offer is you manage the templates to get the thresholds and the overrides for specific applications. You have the option of managing in git. But you. Also, out of the option of just clicking a button and having it disabled and then in an ideal world. You click a button to disable that, and it disables it for 24 hours but like also opens a PR on the git repo to go and do that, and there's a really interesting write-back question like in Gen in the whole Kubernetes world, and with everything git apps related. Um, there's a really interesting, right? Like the back question, you see a problem occurring in some Kubernetes infrastructure, and then going backward and figuring out what you change in git apps is non-trivial. In fact, it used to be somewhat intractable, and what I mean by that is I'll give you a trivial example: you have a deployment that. You got ah, sorry, you have a helm chart, and in that Helm Chart is some value for like ah for like a memory request, right? And that's how much memory the pot needs. Um, if you get the number wrong, the pod crashes, so it's it's an important number that you have in that git repository, ma.

21:36.20
Jake Beck
Um, yep.

21:44.47
Natan
Ah, number for a memory request. That's one of the Helm values, and then Helm has a bunch of templates and runs all those templates, and then somewhere in there's the output a deployment file, and then that runs on Kubernetes and creates a rep the cause that creates a pod and then the path turns, and the pod is eventually killed because that initial number in Hel is wrong. You can go back from that pod, and you can find the deployment really easily, and then you can know that that deployment was created by Helm, but it's actually probably like a noncomputable problem to take that thing that was templated out, and then to go back and find. The value of that template, I mean it, is not exactly noncomputable. It's potentially noncomputable in the generic case because like reversing that tempalization step is very hard. You can do it.

22:32.70
Jake Beck
Ah, I've done it. I've like, yeah, I just like, I just have flashbacks of me having to do this when you're walking through it. I'm like, oh no, it's.

22:35.96
Natan
Um, but did you do it with cold? Doing it as a human is easy. Did you do it with cold, though? Like tab code that can generically reverse it, I should be careful here saying it's not incomputable like there are some. Actually, there are some approaches, and we can touch on that. It's probably super technical. Maybe a little toothpick, think all for the top care. But, um, doing it as a human, we do it all the time. Everyone everyone does their own time, but having software that's opening a PR and then going back and figuring out which night that is.

22:59.10
Jake Beck
Right.

23:10.16
Natan
Like, but I haven't, like, I haven't proven it, but I haven't taken the time to prove it. But it's potentially noncomputable like you get into the Turing problem and see the halting problem. It's potentially something that's not computable from a theoretical perspective, I mean.

23:24.70
Jake Beck
Um, yeah.

23:26.80
Natan
And there are approaches if you like. Add the how plugin you addline annotations. There are probably approaches that can tackle it, but the cool thing is that you can actually solve that trivially now, um, with AI like just being in the values being the few templates and like, say, which one of these is responsible actually, then that is like, a really trivial solution.

23:45.67
Jake Beck
That's super interesting, like the AI approach. Yeah, like, I guess you guys are taking advantage of something similar with that Ai, right?

23:51.54
Natan
Yeah, we rolled out a feature that uses Ai to do what a DevOps engineer would do, or good that you would do or like a great engineer knows Kubernetes would do a great developer who knows Kubernetes would do, and the way it works is this. You open up your application on Kubernetes, so like imagine. Like, imagine opening up men's by. It's in the browser, and any developer who has permission can access it, so a developer opens it up and sees his application, and there's some error in this application. It could be that the application is impending. It could be that his application is healthy because it has a firing Prometheus there. It could be a bunch of stuff, right? Is the application some issue, and Robusta like shows you, okay, it's rad. So, like, that's bad, right? And then it gives you all the information on that page, and then there's some stuff that we did also pre-ai to like automatically gather the right data and surface to you and show to you, but then the cool thing is we had their own recently, I have some really good feedback on this. Um. And the AI investigation button, and you just hit that button, and it gathers up a whole bunch of data. So it gathers up, like, ah, the Kubernetes, the games for that pod, or relay resources; it gathers up the Kubernetes event, so like, ctl gets events, right? Then, it filters out the stuff that's relevant, and it gathers up. I mean the dogs, right? It gathers up as if they were eyes probes failing. It seems that that's actually in events, I think. Um, it gathers up all these different data, and then it feeds into an AI model. Um, and we built it to be like pluggable, or you can bring a private tenant, or by default, we just use Azure with our own instance, I mean.

25:26.31
Natan
And then it goes, and it spits out for you. What is the problem? Plus, why is the data like pulling out all that data? Why is the data that was relied on to reach that conclusion so you can see that it's not hallucinating? What do we recommend doing to fix that? The final that will be perfect is then that right back to get.

25:39.65
Jake Beck
Um, that's so that's super cool.

25:45.31
Natan
Which we don't have yet. I think with Ai, it is actually doable now.

25:47.12
Jake Beck
Now that's super cool, just like using it to speed up that investigation process. Even if it requires some intervention from a human right? Like even just gathering all that data is super valuable in 1 specific area. Because it's not always going to be. It's AI, right? There's never a hundred percent accuracy with AI, but having it the like, as you said, the like here's the data that we made this decision from, and like you can go invalidate it then like that's huge for so many people to just speed up that loop of debugging those issues like.

26:20.16
Natan
So there's a UK colleague of mine, another company consultancy, and um, they took this, and they were playing around with him. They got it hooked up also, so they got me. I think he grew traces, and they did much of stuff, and then what they were able to do in one of the demos that they gave at a conference recently is they showed how I mean, they break. For you have microservice a that's speaking to Microservice B, and you break b, and then it starts malfunctioning, but nothing changed on a, and then with the ai, one of the things that we feed in is like this of all recent changes in the cluster and then with ai as they spill and say this broke because of this change in a completely different microservice which if you're the. Like the developer you're looking at microservice, hey, you have you often. I don't even know that can be changed, so that's like another cool example of where the data, like almost, I want to say it doesn't matter if it loses needs, but we're not asking questions or are hallucination prone, meaning we're not asking. Um, it.

27:14.69
Jake Beck
Um, a.

27:16.82
Natan
But 1 of the interesting things actually is we found that if you say, is there a problem. We actually got really bad results. We had a very high elucidation rate. There was always a problem. Whatever it was like, you're feeling in the bugs. There will always be an error, and AI actually had a really hard time with current models staying off dark models. We had a really hard time. Understanding. What is a problem that humans interpret as a problem, and what is a problem that you advance and say that's not interesting? Um, but if there is a problem like another firing or some unhealthy pod status, then saying like where is the other data I should be looking at and feeding in a bunch of data and having it pick out the right pieces. That's not very hallucination-prone.

27:56.29
Jake Beck
Yeah, that makes a lot of sense, right? Yeah, like, oh, you could say that with anything, right? It's like, is there a problem? It's always going to say yes, like there's always probably some problem, but is it a real problem? Maybe not, and that's where, like, what you're saying is feeding it the right data to actually discover. I guess it's a failure point instead of a problem, right? Because, like, a problem can be super vague, like I will be super smart, and as you said it, it works now versus, like, if you just ask if there was a problem, it was just if there wasn't one. It's still just going to say yes, so that's awesome.

28:25.99
Natan
Yeah, and then the industry term is like Rag or retrieves augmented generation, and what real difference is not? You're not saying you're not like saying there's a problem with what it is. You're saying here's a problem. Here is a ton of context, like as much as we can pack into the context name. The.

28:33.90
Jake Beck
Um, okay.

28:44.78
Jake Beck
A.

28:45.25
Natan
Like the length of a prompt, here's a total context, and then pick out the relevant stuff. Um, and by giving all the extra context now, not relying on stuff that knows from training time, it's playing with stuff that knows because you just fed in right now in the prompt, and you get much, much more accurate results. Relevant to what's actually going on in your environment.

29:02.50
Jake Beck
Right? That's super interesting. I think the future with AI is pretty bright in a lot of areas, especially with something like that. There's definitely. Areas where it's not so bright, I think, and that's a whole different conversation, but it seems like you're really taking advantage of it and using it the way it's really meant to be, so congrats to you guys for getting ah getting on getting on board early. It seems like so.

29:31.90
Natan
Well, thank you. It's weird because it happened all backward. When I was a kid, they said that AI would eventually replace programmers and that AI would like me. Eventually, we were like, okay, but maybe there is good programming stuff now, but they essentially said like Ai would be really good at logic, and like, that's how we all thought of it, right? Like you're in having 0 robots, like super logical, right? You think of the caricature of a robot like it's super logical, and it doesn't understand any common sense.

30:04.34
Jake Beck
Um, the a.

30:09.67
Natan
And, of course, it's terrible at creativity, and somehow, it all happened absolutely around. We got stuff. That's really creative, but now logical, and I will lie and will die without realizing it. It's like we got. We got all the imprecise stuff and all the horrors before we got actually like the hard science and the facts which.

30:10.15
Jake Beck
Um, yeah.

30:21.99
Jake Beck
Um, if.

30:29.35
Natan
I think it is just to say that, um, AI is more like humans than we tend to admit.

30:34.70
Jake Beck
Yeah, no, I Totally agree. It's it's almost like scary. It's it. It makes similar mistakes to those of humans, and it learns in a much different way, obviously, than humans do. But. It often ends up with the same results. It's it's kind of funny to think about it that way. Well, I think we are winding down on time. So, if you want to let people know where they can find you? Yeah.

30:51.41
Natan
Um, yeah.

31:03.98
Natan
Um, so I'm on Linkedin, Natan Yellen, and Twitter, and I like my first name, but I must have been a little bit dyslexic in high school. Please feel free to reach out. To hear from people a lot of times. You record a podcast, and it goes out, and then like either you get stuff from people or like no one writes anything ever, and you that you even know people actually listen to it, so it's always really hearing from people, and if people have questions.

31:32.14
Jake Beck
Um, yeah.

31:37.18
Natan
They are interested in learning the stuff we spoke about today, have opinions, or are doing stuff like that internally. Um, I always like to chat about their approaches and kind of hear about what approaches other companies are taking, from AI troubleshooting to the stuff around the right back to get. Also, I like how other people think about it. Getting ah, encouraging others to take action when it is too spammy, so encouraging debs to take action.

31:55.91
Jake Beck
Are. Yeah, awesome. Well, thank you for your time.

32:05.81
Natan
It's been a pleasure. Thank you.

S3 Ep5: DevOps & AI Unveiled: Exposing 'You Build It, You Run It' Myth

About

Episode Guests

Key Highlights of This Episode:

Featured Episodes

S3 Ep8: Cloud Trends Unveiled: Navigating the Future of Cloud-Native Platforms & Gateways

S3 Ep5: DevOps & AI Unveiled: Exposing 'You Build It, You Run It' Myth

S3 Ep9: Developer Productivity: the Inner Dev Loop & Quantitative Metrics