K007: “Monitoring” For Fault Tolerant Modern Infrastructure
transcript
Summary Keywords
monitoring, infrastructure, managed service, data, monitoring tool, configuring, system, events, performance, alerts, cloud provider, connectivity, managed, observability.
Kamalika Majumder : Hey everyone.
Welcome back to cloudkata The Modern Infrastructure Show. This is season one - Anatomy of Modern Infrastructure and today's episode is about the 6th factor - Monitoring. If you have not yet caught up with this series, please subscribe to cloudkata on Spotify, Google podcasts, Apple podcast and Stitcher. Or you can also visit cloudkata.com And subscribe to the show to know about the first five factors which we have covered in the previous episodes. So let us begin this journey of modern infrastructure with today's episode about monitoring.
Intro Music Plays
Hey, everyone, welcome back to cloudkata, the modern infrastructure show. This is season one anatomy of modern infrastructure. After a long gap since last year, I am back today to continue this journey of modern infrastructure.
And I would like to begin today's episode of monitoring with a very funny story. So it was around 10 years back when I had just started my system administrator career. One fine day, our boss gave us a call. And we were still working. It wasn't in the evening. And he sees that he asks us, how's everything? So we say, yeah, it's all fine. He's like, ha, ha, how is it? How are things on the infrastructure side? You know, in those days, it was all the server rooms and etc, etc. It's like, is everything working fine? Is there any issue etc? To say No, everything is fine. Everything is working fine. So he's like, Are you sure because I'm not getting any alerts or notifications? So to that we respond like, that's exactly why we are saying that everything is fine. That's why we're not getting any alerts. So it actually took us around 510 minutes for us to convince him that the reason he is not getting alerts and any notification is because everything is working fine. That particular incident. And today, it's like quite few years after that and situation has remained still the same, it takes us quite some time to convince the people who are in the executed, you know, who are in the executive level to convince that everything is working fine, unless we are able to showcase it to them, right. And that's because we lack the, you know, kind of observability to build an observability in the existing setup. So, let us begin our journey for today's episode by learning how to build that observability.
So today, in this episode, we are going to learn how end to end monitoring can reduce infrastructure downtimes and issues by identifying fault lines and resolving bottlenecks proactively in infrastructure. And we are also going to understand how a centralized monitoring can ensure everybody right from an engineer to the CXOs is on the same page when it comes to the state of infrastructure. So let us first begin today's agenda. First we will cover what centralized monitoring is and how we can achieve that. Then I'm going to talk about how to categorize the monitoring data that is available and how that monitoring data should be structured in infrastructure monitoring, application monitoring, and how do you actually choose your monitoring tool or tools. And once you have done that, how to set up alerts so that you get notified or be warned in advance about anything that might be happening in your infrastructure. And finally, I will close this episode in the in the inane by giving some key thoughts and key things that we need to consider when it comes to using these monitoring tools or monitoring setup by our operations or the financial teams.
So let's get started with monitoring for building observability in your infrastructure.
So let's start with centralized monitoring. What is centralized monitoring? In my view, my ideal designed for monitoring is a one stop station for monitoring all kinds of metrics and analytics. And it is a centralized dashboard, which captures state of infrastructure resources like networking devices, systems, servers, anything that that relates to set Copy of infra then and your then comes your application, your front end back end services, you know, apps that you are building back in a state of back end services, your deployment state your health check performance, and even your mobile application, you know, what is your How is your mobile application performing. And finally, to capture all the data that is necessary for your business like user activities, transactions, etc. So when the ideal monitoring setup will capture all of these in a centralized dashboard and we'll present to you the state of your infrastructure, it should be a one stop place where you just go and see what is happening. And even if there is an issue, you can drag the issue right from where it happened and trace it back to its roots so that you can solve it as soon as possible. So, how to achieve that.
So first of all, before you build it, you need to make sure that you categorize the monitoring data. Now, how do you get it? How do you categorize monitoring data? There are two ways that you can categorize your monitoring data. First is dividing it into what to monitor, and then how to monitor. Now, when it comes to ask this question, what to monitor, you would like to monitor infrastructure, applications, services, anything that is there, that is Gen that is running your services, you need to monitor that and how to monitor it, that you need to monitor the state performance and events of all these things. Infra app services. So let's talk about infrastructure monitoring. First, right now, infrastructure monitoring, three things to monitor in any kind of monitoring system is, as I mentioned, state performance and events. When you're monitoring your infrastructure state, what all falls under state a it will include your health check your uptime downtime of infrastructure as in, is your server is your networking devices all up and running, or were there any glitches or any unplanned downtime that had happened, you'd like to capture that, then you'd like to capture the availability of your data centers or since we are talking about modern infrastructure or cloud native solutions. When you are on cloud, you might want to monitor the availability zones, which is equivalent to the, you know, legacy data centers or technically they are data centers which the cloud providers use to host the hardware, right. So you want to monitor the availability zones. So because nobody gives you 100% uptime, let's be very honest. So you need to know whether your availability zone is all running, or how many times it goes down? Or was it all up and running? Did you actually get the service available liveability that the cloud provider has promised you. So you would want to monitor the availability of your infrastructure, then the fourth one is connectivity. That means, in this age of cloud, you, you might end up having a hybrid cloud setup, that means you might have your own solutions, and you might want to integrate with some other third party solution. Let's say you are a bank, for example. So although you are developing the banking applications or websites, you would have your payment, your switching gateway for another third party. So, you would want to monitor whether the connectivity from your application to that switching gateway is working fine or not. Right. So connectivity across your third party partners or integration points. And also internally that means whether the connectivity between your app servers to the database systems or to the messaging systems are all working fine or not. So four things to monitor in state are help check uptime, downtime, availability of your data centers, or availability zones and connectivity. So health check uptime, downtime, availability, and connectivity, four important things to monitor when it comes to the state of infrastructure. Then the second grouping is performance. Now, once you have started monitoring the state of infrastructure, you have already gone one step ahead to make sure that your infrastructure is always always available, and if it is having some glitches you would want to fix that. And the second thing would be performance. Now, what comes under performance performance has include things like your CPU memory disk utilization, so you might want to monitor how your app. So, you know the virtual servers are doing or many systems are doing, when it comes to peak load, let's say you are going to a big launch or you have a particular event of user registrations or some, if you're in retail, you are launching some sale on Black Friday even so, you will want to monitor whether you are ready for that kind of activity or not. And that you know, once you have confidence and you have collected your CPU, memory and disk utilization, for the application with that kind of load and this data is also very important, when you are running your load testing or performance testing for the kind of traffic that you are expecting, second thing is performance is network bandwidth. Same way when it comes to monitoring, whether you are ready for the performance that you want to deliver, you need to monitor the network bandwidth the upload download speed, the latency across your systems and your third parties and also within you know across your systems. Then the third thing is peak hour traffic. So, when you are sending and receiving data, you want to know when the traffic is picking up? You know, are there any spikes? Do you need to upgrade your bandwidth etc. So, these are the three things that you would want to monitor under performance. Then third thing, third aspect of infrastructure monitoring or any kind of monitoring is event and what are the kinds of what what kind of events you would you would want to monitor important security events like authentication failures, are there too many failed logins or even you know, a successful logging, whether it failed login or sometimes nowadays, people work with zero trust policy. So they would also want to know, the successful logins, you know, you would want to be known you want to monitor whether there are any unauthorized access that has happened. failed attempts, you would want to also monitor the change management system, let's say you're deploying a new version of the application, you would want to know that if there are any firewall policies that are that have been added, some ports have been added you would want to be you'd want to monitor that part and when happen, did it happen because it is very important you know, unless you know what happened, you will not be able to arrive at the conclusion whether you are debugging it or you're preparing a report, then you want to monitor how many deployments you have done, you know to see the frequency of your deployment or how long your deployments took those kinds of events. So, these are some of the things that you would want to monitor when it comes to infrastructure monitoring. So, let me do a quick recap, three groups, three ways that you will have to monitor a state performance events under state you can monitor health, uptime, downtime, availability, connectivity, under performance, you can monitor CPU, Memory, Disk, network bandwidth, peak hour traffic, under Events, you can monitor any critical events, authentication failures or unauthorized access any kind of change events, configuration changes like firewall rules or deployments. So, this all will comprise of what you would want to monitor in your infrastructure. Now, let's see what you should be monitoring in your application right same way state performance events, these are the three major categories in which we will divide any monitoring data and it becomes very easy if you are able to group this otherwise, you will end up having one big pool of data and you will just scribble around and in terms of specific events, right, let's say something is down, you try to scrape around all the data that you have. So, it is very important that you group the monitoring data that you have. And I feel it is very useful when I group it in three categories like state performance and events. So, what all comes under application monitoring under state as I mentioned earlier, in infrastructure, you can monitor connectivity with your third parties, right third party partners or integration points are likewise even in application you can monitor it and in fact, when you're monitoring in a centralized system, application will fail earlier than you know infrastructure would have failed before but application will actually also show you what is the impact of that infrastructure failure. Sometimes it may or may not be an infrastructure issue, sometimes your network and your third party partners network might be working fine, but there might be something in the system in your third party site or your site, which is causing the failure. So, it is very important that you set up monitoring in the integration that is happening. That means, if your application is connecting to the third party through some HTTPS or you know, by some other threats, TCP connections, you make sure that you monitor that kind of action that is happening. So, when connecting with something which is outside of your network outside of your system, some other third party systems, you need to monitor that connectivity, that integration points within your application and also outside. So, when you are building your application, you might be having a lot of components right apart from the application that you are writing, you will have your database systems you will have your message queues you will have your caching systems, memory systems etc. So, you need to make sure that the integration across your application is working fine from your application, that means your application is able to connect through all the flows that it needs to the functionalities are working fine or not. The third thing is that the front end to black and blue needs to be monitored. Then the fourth thing and it is quite important that you have a true path for fault identification. So, what is true path some monitoring tool tells it to path or pure path, which means you you go right from where the event has happened and you trace it back to its root cause sometimes you will not know it upfront whether it is a database issue or it is a application issue or it is some other mirror system failure. So, what you will have to do is you know the error, you get notified with an error that has happened or some failures and you go and trace it right back to where it first occurred. So, this is a form of debugging, most tools, many advanced tools provide it and it is very important that you make sure that your monitoring tool or monitoring system is able to give you the entire path otherwise you will be you know hopping from one system to another and you will be taking wild guesses on what actually fail and where is the issue. Okay, now the work comes under performance. So these are the things that you can monitor under the state of application, which is connectivity, integration, front end to back end flow, true path or pure path for fault identification. Second thing under performance, what can come under performance is the crash analytics for your mobile applications. Performance is, you know, monitoring your performance test. When you're running your performance test or regression tests. You need to know how your application is doing the same way you would monitor your infrastructure, you will also want to know is your application able to take the load is it able to take let's say you want to onboard 500 users in one shot in together or simultaneously, right is your application, especially if you're developing mobile application is your mobile application able to handle that kind of load. So you need to monitor performance. And it's not just while you're running the tests, it's also to monitor your production in the live system because if your application crashes, and you are not able to meet the number, so you might be able to you might, you know, not comply to the promises that you have made to your customer. So it is very important that you monitor the performance of your systems, performance of your applications, your crash analytics and synthetics of your mobile application. Then the third category events that comes under events, as I mentioned earlier user activities, you know, onboarding, registration, you know, how users are moving from one place to the on one format to the other, you would want to monitor it sessions, how long they are actually logged into your sessions. Then transactions, how many transactions had happened, how many downloads have happened. And Business Analytics, you would want to monitor those analytics in your application, business metrics, behavior conversions, how many conversions you have. So all these will include your, the events that come under your application monitoring that you will have to monitor. So in short, like we saw in infrastructure in application, you can monitor a whole lot of things. Right from here under your state monitoring, you can monitor the connectivity integration, end-to-end true path and Pure path. Under performance, you can monitor the crash Analytics, you can set up performance line pipelines or regression tests or monitoring. And under events, it will be typical monitoring your business flows, your business behaviors, user activities, transactions downloads, how much you know how fast your application is moving, all these will help you gather enough observability in your system and in your service or product. If you're building one, and make sure that you answer you have an answer to the issues that might happen or you can prevent them proactively and not to wait for it to happen, right. So, we saw that there are a lot of things that you will have to monitor infrastructure application, you know, there are a lot of things that come under monitoring. So it becomes very important to understand what kind of monitoring tool which we are choosing, you know, in the market, you will get it, you will get one tool for each of these things. Now, as I mentioned before, it is very important that you have a one stop station you can use. I'm not saying that you will get one tool where you will monitor everything you need to have because most of the time these tools are pretty expensive. It's not free, of course, but even if you are going with the open source software you need to make sure that at the end, the final, you know, dashboard that you have for monitoring is all consolidated in one place. Otherwise, you will waste your time going from one tool to the other. So how to choose that monitoring tool or what is the best monitoring tool that you can choose. First thing that you will have to keep in mind while choosing a monitoring tool is whether you are going with a managed service or you're going with a self managed service. So to simplify it, management services are mostly the services. If you are on cloud, you will get these services which are provided by the cloud provider themselves. Most of the time, they have some out of the box solutions, like the monitoring systems are provided by cloud providers by default for some basic monitoring. But if you are going with advanced monitoring, like I mentioned mentioned performance and events and you know state, you will need to pay for it. And these these will be completely managed by the cloud providers or managed services are also provided by some SAS vendors, you have monitoring tools, which are hosted by the windows and you can just go ahead and login and you know, connect to it. So these are the managed tools which are completely maintained by the provider. And you just take care of configuring it or integrating with it. And the second type of tool that comes in our self managed or the open source tool, wherein they are free, of course, but you will have to set it up yourself and maintain it. So what are the advantages and disadvantages or if I may say pros and cons on both sides. So let's see manage tools, right? Managed tools are very beneficial. If you have a paid very big setup, and you have to monitor a lot of things. Especially if you are building a mobile application with microservice architecture, you have a lot of services floating around you have a lot of components floating around. So it becomes very easy on your part if you're using a managed tool, because the operations part of it, that means setting up that tool or you know, managing the availability or scalability of that tool is taken out of your head. And it is being that you have hired someone to take care of it basically, however there is a risk in a managed tool is that most managed tools, if you're going with a cloud provider, or even if you're going with the SAS provider, these may or may not be private. I mean, the account may be private to you. But the infrastructure may or may be shared with other customers of that vendor because these are SAS products. So it's something like say Google, right. The Google account is private to you. However, where your data is hosted by Google it is shared with all of Google's customers so and sometimes if you have not sanitized your monitoring data, you may risk your personal information if it is there in your monitoring data. And it might put you at risk for compliance violations or regulatory violations. So, you may have to be careful while you're hosting if you have a compliance or regulations you may want to check with that vendor where they're hosting that monitoring data, because some countries have very strict policies where you cannot, you know, key personnel information outs outside the country. So, you will, you may have to take care of, you know, confirming with the provider where the data is hosted because that monitoring data will have sensitive information. Other than that, the other risk that you might face is the cost because most often these monitored managed services are charged based on the utilization. So the more you utilize, sometimes it's based on the number of systems that you are monitoring, sometimes it is based on the size of the data you are hosting. So you need to be careful about how you are configuring these managed services. Because otherwise it might end up costing you a hefty sum which might help us you know, you're the whole infrastructure budget itself. So these are some of the pros and cons of managed services. Now, what are the pros and cons of self managed service, of course, the first benefit of self managed services, it's completely private to your infrastructure, you can set it up yourself, you will install it and it's completely up to you. So you do not risk anything if you do not, you are not at risk of violating any security policies or you don't have to bother about it. And most often these self managed services, when we talk about self managed services, most often these are open source software, so they are all free of cost. However, the downside of it is you will have to take care of scaling it out when it is needed because even the monitoring tool will sometimes run out of resources, because if your infrastructure keeps growing or keeps scaling up, the data of monitoring data that you're hosting, it will also grow. So you will have to take care of uptime, downtime, sand, you know, upgrades and configurations of those self managed systems. So it's really up to you to choose which way you go. For some people going with many surveys, even for banks, it is absolutely fine. But what they do is they go ahead and sign up and take a commitment from the vendors that their data will not be compromised in any way. And they may negotiate the pricing also to get it well under their budget. So in my experience, going with a managed service for something like monitoring is quite helpful, if you have a very big setup, if your setup is scaling, and you anticipate that your setup will grow. So, I would recommend you go with managed services. However, if you have a pretty much manageable setup, and you do not anticipate it to scale overnight, or you know, very soon you can start with self managed service. And sometimes these managed services also have a self managed version, you know, they own an open source version. So maybe you can start with an open source version and then you can slowly transform to a managed version. So it's completely up to you what you can do. So, these are some of the pros and cons of managers and self managed service. So, this is the first thing that you will have to decide on when you are choosing your monitoring tool right. Now, the second thing that you have to do once you have chosen the monitoring tool, the second thing would be securing your monitoring tool as I mentioned right no matter whether you are taking a managed service or self managed service, you will still you will have to protect your monitoring tool. So that first thing is you don't compromise because it will hold a lot of verbose level data right from home usernames, user accounts transaction details, etc. So, you need to protect it. So that it is you cannot just set up a managed monitoring system and open it up to the Internet. So, the whole internet will know what is happening in your infrastructure or in your application right. So you need to protect it and you will have to make sure that it is not being exposed to unauthorized people. So that is why you will have to secure it. So security is very important for monitoring tools and posting insecurities, identity management, and it helps a lot if you have identity management, which is integrated with an identity provider. So if you have as an you know, an in house identity provider, let's say or or G suit, or Active Directory, or LDAP, you know, you can integrate your monitoring tools and that way your user would not have to remember one more credential to log into the monitoring tool and it is also secure and you can track who's logging in and who has access, right. The second thing in identity management is if you have an identity provider which is enabled with single sign on, and role-based access control. So that way, you know, they can use one email id, you know, or one user ID to log into the monitoring tool. And then you can also configure rules, basically, who can monitor the data and who can configure it, because if you open up access to everybody, people might accidentally go ahead and turn on, you know, debug logging, and it will just spam everybody, right? So you need to make sure that your monitoring tool is protected with identity management through identity providers' single sign on and role based access control. The second thing that is needed for securing is data sanity and data privacy, especially if you're using a managed tool or managed service. So you will have to make sure that the monitoring data is hosted within the infrastructure that you have purchased and not shared across the world, right? So you will have to make sure that you confirm it with the vendor, whosoever you have taken the managed service for. And of course, if you're going with a self managed service, these things are always taken care of. And data sanity, you will have to make sure that the data does not have any unnecessary, sensitive information, try to sanitize it as much as possible. Sometimes it may be difficult, but it is always good to sanitize the data so that you know you are not at risk of any audit failures, right? So the other thing, the third thing that is very important, while choosing a monitoring tool is operations, you know how to operate it, because it will keep changing, you will may need to keep adding more monitoring, you know, configurations as your setup is growing as you are scaling out. And so that's what I mentioned, when I said we can manage and self manage service operations, the very important things and technical support as well. So we'll have to make sure that those things are taken care of, you cannot just have a you know, monitoring tool and nobody's looking at it, and suddenly you need it. And you see that it is not working right. Compliance of user data and localized caching. Sometimes these managed services cache the data across countries. So if you have a compliance or regulation, where your data cannot live, where the data has to be localized within the country, you need to make sure that you confirm it with that vendor that they are not cashing it to some other locations. Then, you know, another thing is cloud service monitoring. If you're using a monitoring tool, which is provided by the cloud provider themselves, where you're hosting your application, then you're sorted. But sometimes if you're using something which is outside the cloud provider, you will have to see whether it integrates with all the cloud services or not. Because sometimes, they may or may not include all the cloud services. So you need to find a way to integrate it, maybe write some libraries or plugins to use that, but because all your infrastructure will be managed in the cloud, and you need to monitor the cloud service itself, whether it is up and running or not. So these are some of the things that should be kept in mind when you are choosing your monitoring tool. And once you have chosen it, it is very important that you configure alerts. Because just setting up monitoring is not enough. You should set up certain alerts and notifications for some critical events. Now what are those, if you're on cloud, you can set up a you will have to set up alerts for Cloud account billing, route logging, renewal of subscription billing, if you have exceeded your billing margin or if you have not paid for something then you know, root login root account somebody has logged into your root account and you need to know about it, whether it is authorized login or unauthorized, then renewal of subscription if you have a subscription, which is going to renew because sometimes what happens is if you're not notified, they the subscription will expire. And that might even delete your systems and some cloud providers actually have that, you know, once the subscription expires, they will delete the whole system. So you might lose data. So you'll need to configure alerts for these activities and the cloud. Then if anything like we said whenever monitoring things like modification of configurations you need to be alerted. So alerts are very important that you configure. Then, on service availability, whether it is whatever you are monitoring, there has to be alert. You know, a warning a With a failure or error, anything that happens, it has to notify a group of people or if you have a Slack channel, or if you if you prefer SMS, you know, those things have to be configured because without alerts, the whole monitoring setup is useless, then it is better that when you're configuring the alerts, it is also configured in a sanitized when it is not it just not spam people. Because if you start spamming people, then people will lose interest. And that will force them to create filters. And they will, again skip the important information, right. And then it is important that these alerts are acknowledged, because if you have seen something, and you are going to work on it, acknowledge it first. And then you know, start, because then people will know that this alert is acknowledged and somebody's working on it. So configuring the alerts, along with the monitoring is very important. Otherwise, the whole purpose of gathering all these data in monitoring is of no use. Now, finally, some things to consider when you are configuring your monitoring dashboard is that not everybody can be technical, as with the level of a developer or a system administrator, most often the first responders are the operations team, and they need a better dashboard to take a look at issues, they may not a developer might immediately tell you what the issue is. But the operations team or people were first responders on level one teams who are who are handling it, not everyone may be aware of the exact issue. So it is very important that the dashboards are very clear, crisp, and it shows you the exact data and it's not just you know, a messy thing going all over the place. And this is another downside of open source, software's that dashboarding is not that clear or clean, you need to put a lot of effort in having a dashboard that your operation teams understand right? And you need to make sure that you train the operation team to use that tool so that they can debug it faster. And it's okay to go with a licenced monitoring tool provided it covers all the layers of monitoring like infra app mobile business etc. Do not go with multiple license tools, because then you will be wasting money. So you take a look at a tool which gives you all a level of monitoring and not just infra or app or business right and choose a tool that way. And monitoring screens also help in case you have satellite service centers or during critical launches like you know, you can put big screens in your offices, I know now, the situation has changed. We are not really working from offices, but I think we are going back to normal again. So it is important that we have these displays there, especially if you're going to critical launches so that you can have the dashboard which will show you what is happening when you are actually going to go for the big launch right. So, these are some of the things that you have to consider for your operations and finance while you're selecting and configuring your monitoring system. So, to summarize fault tolerant infrastructure and infrastructure which sustains and proactively tells you what is happening if there is an issue, you need robust end to end monitoring and alerting. And this can only be achieved if you have a centralized system to monitor state performance and events happening across the infrastructure and application and which also send you alerts almost immediately when incidents happen right. This is another important thing that when you have multiple systems right the alerting will take up time right because from one system to another system, it will take a lag of even if it is taking a lag of one minute then you will get an alert after two minutes or three minutes. So that is all the more reason to choose a centralized tool which also has an alerting system. And this monitoring system should be integrated to a centralized identity provider and that has role based access control and single sign on so that you have secure ease of access for all your, you know, users and internal users, external users whoever it is. So this is how monitoring should be taken care of. So that you can achieve the kind of observability and confidence on in your setup and you can give that kind of confidence to all levels not just your engineers and your engineers will have We'll find it easy because your executors or your bosses will also know that yes, if I have to just quickly browse to what is happening in the setup, I need to just go to this monitoring system and I'll see everything is green right. So in short, a centralized monitoring enabled with proper alerting and identity system will help you create a fault tolerant infrastructure and reduce downtimes and, you know, pose lesser or faster resolution of issues.
I hope I have covered most of the aspects of monitoring and this session was useful for you. Please do subscribe to this podcast on Spotify, Google podcast, Apple podcasts or Stitcher. This is cloud Carter, a modern infrastructure show. I'll be back with another new episode. Till then, stay safe. Stay happy. Thank you, everyone. Bye bye.
Transcribed by https://otter.ai
Sign up to receive email updates
Enter your name and email address below and I'll send you periodic updates about the podcast.
Other episodes
K013: A Bank On Cloud – Part 2 : The People Side Of The Story
Distance by pandemic united by goal – How I enabled a completely remote in-house DevOps community from scratch for a bank, in the midst of a pandemic when the entire team was locked down across multiple cities and timezones.
K012: A Bank On Cloud – Part 1 : The Tech Side Of The Story
Designing & developing modern infrastructure for one of Indonesia’s first cloud-native Digital Banks. In this two part story, this episode covers the tech side of the project.