Optimize your Azure Monitor Logs usage and reduce costs

Optimize your Azure Monitor Logs usage and reduce costs

Stanislav Zhelyazkov

Stanislav Zhelyazkov

Azure Monitor Logs, also known as “Log Analytics”, is a component of Azure Monitor where you can store logs from various sources - Azure and non-Azure. Retaining logs can become quite expensive. In this article, we’ll discuss what to log, how to discover large logs and offer insight in reducing logs.

Azure Monitor logging

Logs can different, depending on their purpose, usage, source, etc. Here are a few types of logs:

  • Infrastructure logs;
  • Security logs;
  • Application logs;
  • Troubleshooting logs;
  • Metrics or Performance counters;
  • Network logs;
  • Audit logs;
  • etc.

Logs in Azure Monitor are stored in different tables. Microsoft’s current approach is storing each log that differs in structure and source in a separate table.

We have some legacy where different logs were stored in the same table. An example of such a table is AzureDiagnostics where logs from different sources are stored. This approach resulted in a cluttered table thus more and more Azure services are moving to storing logs in their own tables instead in AzureDiagnostics.

What data to log

Logs can be categorized as:

  • Frequency based logging - For example a log record is sent/ingested every one or five minutes. Such logs are mostly metrics and performance counters.
  • Event based logging - A log record is sent/ingested when an actual event occurs to generate that log record. Keep in mind that there are some metrics that can be logged that way. For example they will generate a record only when the value for the metric is different from 0.

The Log Analytics workspace is a time-series database. In other words: log records are stored by the time and date they were uploaded into Log Analytics. In its default state, the records are purged once they’re older than 31 days. Azure Monitors Logs pricing model is based on the size of ingested data. While you’re billed for the amount of data ingested, there are no duration costs during that first 31-day period.
It is possible to extend the default retention period, for an additional fee per stored GB. This is a common scenario for many organizations. Reasons can be:

  • Compliance - To meet certain compliance requirements for your company, certain logs need to be retained for 1, 2, 3, 5, 7, or more years.
  • Analysis - Your company’s processes may require you to compare and analyze data for a longer period, and based on the results of those analysis critical decisions are taken. This includes machine learning to analyze the data.
  • Problem-solving - Oftentimes, it takes a while to notice issues and pinpoint their root cause. In some cases, this requires access to logs for a longer period than just a month.

These are some of the basics of working with Azure Monitor Logs. As you can see there are lot of important things to consider when ingesting and storing logs. Poor planning can easily lead to an increase in the logging data volume - and excessive billing. Common cost-drivers are:

  • Ingesting and storing logs that you do not need;
  • Retaining logs longer than required;
  • Storing logs on the incorrect tier (I’ll get back to this in detail later on);
  • Not accommodating that storage data volume increases with new resources created.

Not taking these considerations into account could result in a quick increase of the Log Analytics workspace data volume, and thus cost. Next, we will look at how to find the size of data, what Log Analytics features we can use to reduce data volume and cost and some tips and tricks on when to use these features.

Analyze Log Analytics usage

To find out the data usage of these logs, Log Analytics provides a system table named Usage. The Usage table contains information about the size of logs for each data table. We can run the query below to get the size of the logs in GBs per table and solution for the past 31 days.

| where TimeGenerated > ago(32d)
| where StartTime >= now(-31d) and EndTime < now()
| where IsBillable == true
| summarize BillableDataGB = round(sum(Quantity) / 1000.,2) by DataType, Solution
| sort by BillableDataGB desc

Usage by table solution

A few things to notice:

  • The usage data time frame is filtered by StartTime and EndTime, not TimeGenerated
  • We’ve exclusively filtered billable data (e.g. Azure Activity Log is not billed for 90 days)
  • We summarize the size by DataType (i.e. name of tables) and Solution
  • The file is sorted by BillableDataGB

The screenshot shows the majority of usage stems from the table AzureDiagnostics. This table contains data from multiple sources and resources. To discover the way data is spent, we can further drill down by using system columns like _BilledSize and _IsBillable. These columns are present in every log record in every table.

For the table AzureDiagnostics, three columns provide information about where the data is coming from: ResourceProvider, ResourceType and Category. Using the query below, we can find the table’s log size per resource provider, resource type and category for the past 31 days:

| where _IsBillable == true
| where TimeGenerated > now(-31d)
| summarize BillableDataGB = round(sum(_BilledSize)/(1024*1024*1024), 2) by ResourceProvider, ResourceType, Category
| sort by BillableDataGB nulls last

Azure Diagnsotics table breakdown

The screenshot shows that the largest log sizes stem from Azure Firewall resources, specifically from AzureFirewallNetworkRule. If needed we can even break down further to see how log size is distributed per Azure Firewall, Source IP, Target IP and action with the query below.

| where _IsBillable == true
| where TimeGenerated > now(-7d)
| where ResourceType == 'AZUREFIREWALLS' and Category == 'AzureFirewallNetworkRule'
| parse msg_s with Protocol " request from " SourceIP ":" SourcePortInt: int " to " TargetIP ":" TargetPortInt: int * 
| parse msg_s with * ". Action: " Action1a | parse msg_s with * "was " Action1b " to " NatDestination 
| parse msg_s with Protocol2 " request from " SourceIP2 " to " TargetIP2 ". Action:" Action2 
| extend SourcePort = tostring(SourcePortInt), TargetPort = tostring(TargetPortInt) 
| extend Action = case(Action1a == "", case(Action1b == "", Action2, Action1b), Action1a), Protocol = case(Protocol == "", Protocol2, Protocol), SourceIP = case(SourceIP == "", SourceIP2, SourceIP), TargetIP = case(TargetIP == "", TargetIP2, TargetIP), SourcePort = case(SourcePort == "", "N/A", SourcePort), TargetPort = case(TargetPort == "", "N/A", TargetPort), NatDestination = case(NatDestination == "", "N/A", NatDestination) 
| summarize BillableDataGB = round(sum(_BilledSize)/(1024*1024*1024), 2) by _ResourceId, SourceIP, TargetIP, Action
| sort by BillableDataGB nulls last

Note: The execution time of a query is heavily affected by complexity and data scope. To maintain speed, I have limited the query to the last 7 days.

You can find more information on how to analyze Log Analytics usage data.

Optimize Log Analytics data usage

Now that you’ve learned how to analyze your Log Analytics usage, it’s time to optimize the workspace with different ways to help you reduce your data usage:

  • Use Azure Monitor Agent and Data Collection rules over Log Analytics agent - Azure Monitor Agent provides the granularity of ingesting specific events and performance metrics per machine or per group of machines. Taking the configuration of Logs at workspace level, the agent can apply the same data configuration standards for all onboarded machines. This way, you can control exactly which data is logged, and how. A few tips:
    • Attribute low Log frequencies to less critical machines. Production machines might require gathering performance metrics every minute. However, for Development and Test machines a five-minute frequency can do.
    • Attribute lower Log frequencies to metrics with low fluctuation. Disk size performance might not vary much, so avoiding Logs from hundreds of machines with various drives and volumes can save quite some costs.
    • Instead of writing all events from a System or Application log to the storage, gather only the events you need to monitor.
    • For example, when you use Sentinel you could gather one set of security events from one group of machines and another set on another.
  • Fine tune collected data - It is quite common to see a lot of data ingested by Container Insights solutions but some of the data can be reduced with a little bit of post-onboarding configuration. For example if you also switch to ContainerLogV2 you could reduce the size by up to 10%.
  • Set retention per table and leave the workspace retention to its default. That way if you want a table to have specific retention you will have to explicitly increase it and thus agree that additional cost will occur. Also new tables will automatically will have the workspace’s default retention. A few examples:
    • Tables likes Perf, InsightsMetrics or AzureMetrics contain performance or metrics data. If you do not have a particular compliance requirement you probably do not need the data beyond 31 days or 3 months. Also keep in mind that the AzureMetrics data is available free-of-cost for 90 days on the Azure Monitor Metrics platform.
    • Tables like VMConnection, VMBoundPort, VMProcess, etc. do not need higher retention.
    • Look at every table and set the retention to your needs, no matter how big or small the cost saving will be.
  • If you have logs (tables) that are purely for troubleshooting purposes and you do not need to alert on them you can set Basic tier for these logs. A perfect example is the table ContainerLogV2.
  • Set archival tier per table - To meet certain compliance rules, you may need some of the data available for a longer period of time. In most cases you do not work with any of the data older than 3 months. In this case, you can put any table after a certain time period to the archival tier. Just like Basic, storing data on archival tier costs less. With archival tier you can always pull the data back if you need it for some purpose, like for example an audit. As security or audit data can be quite large in volume, those are perfect candidates for archival tier.
  • If you reach a certain size of data ingested per day, set commitment tier SKU for the workspace. Do this after you have done all other cost optimizations in order to make sure that the option is still applicable for your environment.
  • Configure diagnostic settings with Storage account - For Azure resources that support diagnostic settings you can send logs and metrics to Storage account. If you do not want to use the Archival tier on tables or if the table contains data from multiple sources like the table AzureDiagnostics, you can send certain logs to Storage account directly. For example, Azure SQL Database audit data goes into AzureDiagnostics table. You may want to retain SQL audit data for a longer period of time, but not all the data in table AzureDiagnostics. In that case, set the desired retention for the table AzureDiagnostics and from Azure SQL Database resources to send audit data to storage account where it is retained for a longer period of time.
  • Configure diagnostic settings with only the logs that are needed and used. Many services have more then one log that can be send. In many cases you may not need some of these logs at all. An example is logs from Azure VPN Gateway. In most cases you will not need IKEDiagnosticLog and P2SDiagnosticLog in your Log Analytics workspace. The final choice depends on the answers from these questions:
    • Do I need the log to alert upon?
    • Do I need the log for compliance reasons?
    • Do I need the log for troubleshooting purposes? If yes, do I need the log all the time or can I ingest only for the time when there is an issue I need to analyze to resolve the problem?
  • Use ingestion-time transformations is a new feature which allows you to transform data as it arrives at the Log Analytics workspace. Currently it is in preview and only certain tables are supported. The table AzureDiagnostics is not supported, but in the future we can take the above example and block data from firewall network rules that is tied to a specific source IP. If that specific source IP generates a lot of data, and we have verified that the source IP is trusted, we can filter out that data to not be ingested into Log Analytics at all and thus reduce cost.

It is important to note that if you have Microsoft Sentinel set for the Log Analytics, you can increase the workspace retention to 90 days, rather than the 31 days default. In that case, remember to keep in sync the SKUs of both services.

Find more best practices on Azure Monitor on cost optimization.

Can we help you reduce your costs?

Within our Sentia Azure Landing Zone we have developed an Azure Monitor workbook that interactively breaks down the cost of logs and hunts the sources of the logs. The workbook provides options like calculating the price of the logs for ingestion and retention when not using archival feature. With this workbook we are able to pinpoint parts of the setup where we can use the features mentioned in this article to reduce cost significantly. Get in touch if you’d like us to optimize your cost as well!