python|victoriametrics的prometheus高可用性和容错策略长期存储

本文的“为什么”?(“Why” of this article?) Prometheus is a great tool for monitoring small, medium, and big infrastructures.
Prometheus是监视小型,中型和大型基础架构的好工具。
Prometheus anyway, and the development team behind it, are focused on scraping metrics. It’s a particularly great solution for short term retention of the metrics. Long term retention is another story unless it’s used for collecting a small number of metrics. This is normal in some way, because most of the time, when investigating some problems using the metrics scraped by Prometheus, we use metrics not older than 10 days. But this is not always the case, especially when the statistics that we are searching for are a correlation between different periods, like different weeks per months, or different months, or we are interested in keeping historical synthesis.
无论如何,Prometheus及其背后的开发团队都专注于抓取指标。 对于短期保留指标而言,这是一个特别好的解决方案。 除非用于收集少量指标,否则长期保留是另一回事。 这在某种程度上是正常的,因为在大多数情况下,当使用Prometheus收集的指标调查某些问题时,我们使用的时间不超过10天。 但这并非总是如此,尤其是当我们要搜索的统计数据是不同时期之间的相关性时,例如每月不同的周数或不同的月份,或者我们有兴趣保留历史综合信息。
Actually, Prometheus is perfectly able to collect metrics and to store them even for a long time, but storage will become extremely expensive since Prometheus needs to use fast storage, and Prometheus is not known to be a solution which permits to reach HA and FT in a sophisticated way (as we are going to explain there is a way, not so sophisticated, but it’s there). We will explain in the present article how to achieve HA and FT for Prometheus and also why we can achieve long term storage for metrics, in a better way using another tool.
实际上,Prometheus能够完美地收集和存储指标,甚至可以长时间存储,但是由于Prometheus需要使用快速存储,因此存储将变得极其昂贵,而Prometheus并不是一个可以在其中达到HA和FT的解决方案。一种复杂的方法(正如我们将要解释的那样,有一种方法,虽然不是那么复杂,但确实存在)。 我们将在本文中解释如何实现Prometheus的HA和FT,以及为什么我们可以使用另一种工具以更好的方式实现指标的长期存储。
That said, during the past years many tools started to compete and many are still competing for solving those problems and not only.
就是说,在过去的几年中,许多工具开始竞争,并且不仅为解决这些问题,还在为解决这些问题而竞争。
The common components of a Prometheus installation are:
Prometheus安装的常见组件是:

  • Prometheus
    普罗米修斯
  • Blackbox
    黑盒子
  • Exporters
    出口商
  • AlertManager
    警报管理器
  • PushGateway
    PushGateway
普罗米修斯的HA和FT (HA and FT of Prometheus) Prometheus can use federation (Hierarchical and Cross-Service), which permits to configure a Prometheus instance to scrape selected metrics from other Prometheus instances (https://prometheus.io/docs/prometheus/latest/federation/). This kind of solution is pretty good when you want to expose only a subset of selected metrics to tools like Grafana, or when you want to aggregate cross-functional metrics (like business metrics from one Prometheus and a subset of services metrics from another one which is working in a federated way). This is perfectly fine, and it can work in many use cases, but it’s not compliant with the concept of High Availability, nor with the concept of Fault Tolerance: we are still talking about a subset of metrics, and if one of the Prometheus instances goes down, those metrics will be not collected during the down. Making Prometheus HA and FT must be done differently: there is no native solution from the Prometheus project itself.
Prometheus可以使用联盟(分层和跨服务),该联盟允许将Prometheus实例配置为从其他Prometheus实例( https://prometheus.io/docs/prometheus/latest/federation/ )抓取所选指标。 当您只想将所选指标的子集暴露给Grafana等工具时,或者想要汇总跨功能指标(例如来自一个Prometheus的业务指标和来自另一个Prometheus的服务指标的子集)时,这种解决方案非常好正在以联盟方式工作)。 这样做很好,并且可以在许多用例中使用,但是它既不符合高可用性的概念,也不符合容错的概念:我们仍在讨论度量的子集,以及Prometheus实例中的一个下降,这些指标将不会在下降期间收集。 制作Prometheus HA和FT的方法必须不同:Prometheus项目本身没有本地解决方案。
Prometheus can achieve HA and FT in a very easy way, without the need for complex clusters or consensus strategies.
普罗米修斯可以非常轻松地实现HA和FT,而无需复杂的集群或共识策略。
What we have to do, is to duplicate the same configuration file, the prometheus.yml in two different instances configured in the same manner, that are going to scrape the same metrics from the same sources. The only difference is that instance A is also monitoring instance B and vice versa. The good and old concept of redundancy is easy to implement, it’s solid, and if we use IaC (Infrastructure as Code, like Terraform) and a CM (Configuration Manager, like Ansible) it will also be extremely easy to manage and maintain. You do not want to duplicate an extremely big and expensive instance with another one, it’s better to duplicate a small instance, and to keep only short term metrics on it. This also makes the instances quickly recreable.
我们要做的是在以相同方式配置的两个不同实例中复制相同的配置文件prometheus.yml ,这些实例将从相同的来源中获取相同的指标。 唯一的区别是实例A也在监视实例B,反之亦然。 冗余的好概念很容易实现,它很牢固,如果我们使用IaC(基础结构代码,例如Terraform)和CM(配置管理器,例如Ansible),那么它也将非常易于管理和维护。 您不想与另一个实例复制一个非常大且昂贵的实例,最好复制一个小型实例,并只保留短期指标。 这也使实例可以快速恢复。
那其他提到的服务呢? (What about the other mentioned services?) Well, the AlertManager has the concept of a cluster, and it’s capable of deduplicating data received from multiple Prometheus instances and interacting with other AlertManager to fire an alert only one time. So, let us install the alert manager in two different instances, maybe the two that are hosting the Prometheus A and its copy, the Prometheus B. Of course we also use our IaC and CM solution to keep the AlertManager configuration in code.
好吧, AlertManager具有集群的概念,并且能够对从多个Prometheus实例接收的数据进行重复数据删除,并且能够与其他AlertManager交互以仅触发一次警报。 因此,让我们在两个不同的实例中安装警报管理器,也许是两个托管Prometheus A及其副本Prometheus B的实例。当然,我们还使用IaC和CM解决方案将AlertManager配置保留在代码中。
NodeExporters are installed directly on nodes which are the source of the metrics you are collecting, there is no need to duplicate anything there, the configuration of Prometheus is the same, so the only need is to permit Prometheus A and Prometheus B to connect to them.
NodeExporters直接安装在您要收集的度量标准来源的节点上,无需在此重复任何内容,Prometheus的配置相同,因此唯一需要的是允许Prometheus A和Prometheus B连接到它们。
PushGateway is a little bit different, simply duplicate it’s not enough, you have to create a single point of injection for metrics which are pushed to it (while Prometheus works pulling the metrics). The way to make it HA and FT is to duplicate it on the two instances, and put in front of them a DNS, configured as an active/passive failover, so there will always be a push gateway active and in case of failure, the second one will be promoted as the active one. In this way, you can provide a unique entry point to batch processes, lambdas, sporadic functions, etc. You can also use a balancer in front of them, personally I prefer an active/passive solution in this case, but it’s up to you.
PushGateway有点不同,仅复制它是不够的,您必须为要推送到其上的指标创建一个单一的注入点(而Prometheus会提取指标)。 使其成为HA和FT的方法是在两个实例上对其进行复制,并在它们前面放置一个DNS,该DNS被配置为主动/被动故障转移,因此将始终存在活动的推送网关,并且在发生故障的情况下,第二个将被提升为活跃的。 这样,您可以为批处理,lambda,零星函数等提供唯一的入口点。您还可以在它们前面使用平衡器,在这种情况下,我个人更喜欢主动/被动解决方案,但这取决于您。
BlackBox is another tool with no concept of HA and FT, but we can duplicate it also, in the same two instances, A and B, that we have already configured.
BlackBox是另一个没有HA和FT概念的工具,但是我们也可以在已经配置的两个实例A和B中复制它。
Now we have two small instances of Prometheus, with two AlertManager that are working together as a cluster, two PushGateways in active/passive configuration, and two BlackBoxes, so HA and FT are achieved.
现在,我们有两个Prometheus小型实例,其中两个AlertManager作为一个群集一起工作,两个处于主动/被动配置的PushGateway,以及两个BlackBox,因此实现了HA和FT。
There is no reason to use these instances for collecting all the metrics in your farm, which might be composed of different VPCs, that can reside in different regions, be part of different accounts, or even be hosted in different cloud providers, and if you are lucky, in your farm there is also something on-premises. There is no reason to do so because the small instances would become extremely big in this case; when something small fails it’s normally easier to fix. It’s common practice to have many Prometheus instances that are in HA and FT configuration (like we described previously) and that are responsible for specific parts of the infrastructure, the definition of part is really up to you, it depends on your needs, requirements, network and security configuration, trust between your teams, etc.
没有理由使用这些实例来收集服务器场中的所有指标,这些指标可能由不同的VPC组成,这些VPC可以位于不同的区域,可以作为不同的帐户的一部分,甚至可以托管在不同的云提供商中,如果您幸运的是,您的农场中还存在一些内部设备。 没有理由这样做,因为在这种情况下,小实例会变得非常大; 当一些小故障失败时,通常更容易修复。 通常有很多Prometheus实例处于HA和FT配置(如我们之前所述),它们负责基础结构的特定部分,部分的定义实际上取决于您,这取决于您的需求,要求,网络和安全配置,团队之间的信任等
So, as a recap, we have small or relatively small instances of Prometheus, duplicated with all the services mentioned above, we have the code to recreate them quickly, and we can tolerate a complete failure of one instance per group of them. This is definitely an improvement in the right way if our HA and FT plan used to be called “hope”.
因此,作为回顾,我们有少量或相对较小的Prometheus实例,它们与上述所有服务重复,我们有代码可以快速地重新创建它们,并且我们可以容忍每组其中一个实例完全失败。 如果我们的HA和FT计划过去被称为“希望”,那么这绝对是正确方法的改进。
python|victoriametrics的prometheus高可用性和容错策略长期存储
文章图片
Prometheus 普罗米修斯 维多利亚度量(VictoriaMetrics) We have a Prometheus and its ecosystem configured for HA and FT, we have multiple groups of Prometheus instances that are focused on their part of the infrastructure and they are relatively small.
我们有一个Prometheus及其为HA和FT配置的生态系统,我们有多个Prometheus实例组,它们专注于基础架构的一部分,并且规模相对较小。
Cool, but we are keeping the data for only, let’s say, 10 days, that’s probably the most important period to query but of course it’s not enough, what about long time storage for metrics?
很酷,但是我们仅将数据保留10天,这可能是最重要的查询时间,但是当然这还不够,长时间存储度量标准又如何呢?
Here come solutions like Cortex, Thanos, M3DB, VictoriaMetrics, and more others. They can collect the metrics from different Prometheus instances, deduplicate the duplicated metrics (you’ll have a lot of them, remember, every Prometheus instance you have is duplicated, so you have double metrics), and they can provide a single point of storage for all the metrics you are collecting.
这里有诸如Cortex,Thanos,M3DB,VictoriaMetrics等解决方案。 他们可以从不同的Prometheus实例中收集指标,对重复的指标进行重复数据删除(记住,您将有很多重复的Prometheus实例,因此您有双重指标),并且它们可以提供单点存储您正在收集的所有指标。
Even if Cortex, Thanos, and M3DB are great tools, definitely capable of achieving the goal of long term storage for metrics, and also to be themselves HA and FT, we chose the newborn VictoriaMetrics. This article will not focus on comparing all those tools, but I am going to describe why we have chosen VictoriaMetrics.
即使Cortex,Thanos和M3DB是出色的工具,绝对能够实现对度量标准进行长期存储的目标,也可以成为HA和FT,我们还是选择了新生的VictoriaMetrics。 本文不会重点比较所有这些工具,但我将描述为什么我们选择VictoriaMetrics。
VictoriaMetrics is available in two different configurations, one is an all-in-one solution, easier to configure, and with all the components together (it’s a good and stable solution, also capable to scale, but only vertically, so it can be a choice for you depending on your needs) and the cluster solution, with separated components, so you can scale vertically and horizontally, for every single component.
VictoriaMetrics提供两种不同的配置,一种是多合一的解决方案,易于配置,并且所有组件都结合在一起(这是一个很好且稳定的解决方案,也能够扩展,但只能垂直扩展,因此它可以是取决于您的需求)和具有分离组件的群集解决方案,因此您可以针对每个单个组件进行垂直和水平扩展。
We like complex things (that’s definitely not true) so we decided to use the cluster solution.
我们喜欢复杂的事物(这绝对不是事实),因此我们决定使用集群解决方案。
The cluster version of VictoriaMetrics is composed of three main components, the “vmstorage” (responsible for storing the data), the “vminsert” (responsible for writing the data into the storage), and the “vmselect” (which is responsible for querying the data from the storage). The tool is very flexible, and the vminsert and vmselect are sorts of proxy.
VictoriaMetrics的集群版本由三个主要组件组成:“ vmstorage ”(负责存储数据),“ vminsert ”(负责将数据写入存储)和“ vmselect ”(负责查询)存储中的数据)。 该工具非常灵活, vminsertvmselect都是代理。
Vminsert, as said, is responsible for inserting the data into the vmstorage. There are many options that you can configure, but for the scope of this article, it’s important to know you can easily duplicate vminsert in an arbitrary number of instances, and put a Load Balancer in front of them as a single point of injection for incoming data. Vminsert is stateless, so it’s also easy to manage, duplicate, and it’s a good candidate for immutable infrastructure and autoscaling groups. The component accepts some options that you should provide, most important are the storage addresses (you have to provide the list of the storages), and the “-replicationFactor=N”, where N is the number of the storage where the data will be replicated. So, who will send the data to the balancer in front of the vminsert nodes? The answer is Prometheus, using the “remote_write” configuration (https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write), with the Load Balancer of vminsert as a target.
如前所述Vminsert负责将数据插入vmstorage。 您可以配置许多选项,但是对于本文的范围,重要的是要知道您可以轻松地在任意数量的实例中复制vminsert ,并将Load Balancer放在它们前面作为输入的单个注入点。数据。 Vminsert是无状态的,因此它也易于管理,复制,并且是不可变基础架构和自动伸缩组的理想选择。 该组件接受您应该提供的一些选项,最重要的是存储地址(您必须提供存储列表)和“ -replicationFactor = N ”,其中N是将要存储数据的存储号。复制。 那么,谁将数据发送到vminsert节点前面的平衡器? 答案是Prometheus,使用“ remote_write”配置( https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write ),并将vminsert的负载均衡作为目标。
Vmstorage is the core component and even the most critical one. Contrary to the vminsert and vmselect, the vmstorage is a stateful, and every instance of it doesn’t really know about the other instances in the pool. Every vmstorage is an isolated component from its perspective, it’s optimized to use High-Latency IO and low IOPS storages from cloud providers, which makes it definitely less expensive than the storage used by Prometheus. Crucial options are:
Vmstorage是核心组件,甚至是最关键的组件。 与vminsertvmselect相反, vmstorage是有状态的,并且它的每个实例并不真正知道池中的其他实例。 从它的角度来看,每个vmstorage都是一个独立的组件,它经过优化以使用云提供商的High-Latency IO和Low IOPS存储,这使其绝对比Prometheus使用的存储便宜。 重要的选择是:
  • -storageDataPath”: the path where the metrics will be saved into the disk,
    -storageDataPath ”:度量将保存到磁盘的路径,
  • -retentionPeriod”: like in Prometheus the period of time the metrics is retained,
    -retentionPeriod ”:就像在Prometheus中保留指标的时间段一样,
  • -dedup.minScrapeInterval”: which in background deduplicate the received metrics.
    -dedup.minScrapeInterval ”:在后台对接收到的指标进行重复数据删除。
Every vmstorage has its own data, but the “replicationFactor” option from the vminsert means that the data is sent and therefore replicated in N storages. The component can be scaled vertically if needed, bigger storage can be used, but because of the type of this storage (High-Latency IO and low IOPS), it will be not expensive even for long term retention.
每个vmstorage有自己的数据,但是从vminsert手段“replicationFactor”选项,将数据发送,因此以N存储器复制。 可以根据需要在垂直方向上扩展组件,可以使用更大的存储,但是由于这种存储的类型(高延迟IO和低IOPS),即使长期保留它也不昂贵。
Vmselect is responsible for querying the data from the storages, likewise the vminsert, it can be easily duplicated in an arbitrary number of instances and can be configured with a Load Balancer in front of them, creating a single entry point for querying metrics. You can scale it horizontally, and also use many options. The Load Balancer, as said, will be the single entry point for querying data, which now is collecting metrics from multiple Prometheus group of instances, and retention that can be arbitrarily long, depending on your needs. The main consumer of all this data will be probably Grafana. Similarly as the vminsert, the vmselect can be configured in an Autoscaling Group.
Vmselect负责查询存储中的数据,与vminsert一样,它可以在任意数量的实例中轻松复制,并且可以在它们前面配置负载均衡器,从而创建一个用于查询指标的入口点。 您可以水平缩放它,也可以使用许多选项。 如前所述,负载均衡器将成为查询数据的单个入口点,现在它正在从多个Prometheus组实例中收集指标,并且保留时间可以任意长,具体取决于您的需求。 所有这些数据的主要使用者可能是Grafana。 与vminsert相似,可以在Autoscaling组中配置vmselect
python|victoriametrics的prometheus高可用性和容错策略长期存储
文章图片
VictoriaMetrics 维多利亚度量 那Grafana呢?(What About Grafana?) Grafana is a great tool to interact and to query metrics from Prometheus, it can do the same with VictoriaMetrics via the Load Balancer in front of the vmselect instances. This is possible because VictoriaMetrics is compatible with PromQL (the query language of Prometheus) even if VictoriaMetrics also has its own query language (called MetricsQL). Now we have all of our components in HA and FT, so let’s also make Grafana an HA and FT solution capable.
Grafana是一个很好的工具,可以从Prometheus进行交互和查询指标,它可以通过vmselect实例前面的Load Balancer与VictoriaMetrics进行相同的操作。 这是可能的,因为即使VictoriaMetrics也具有自己的查询语言(称为MetricsQL),VictoriaMetrics与PromQL(Prometheus的查询语言)兼容。 现在我们在HA和FT中拥有所有组件,因此让我们也将Grafana变成具有HA和FT解决方案的功能。
In many installations, Grafana uses SQLite as a default solution for keeping the state. The problem is that SQLite is a great database for developing purposes, mobile applications, and many other scopes, but not really for achieving HA and FT. For this scope it’s better to use a standard database, as an example we can use an RDS Postgresql, with Multi-AZ capabilities (that will be responsible for the state of the application), and this solves our main problem.
在许多安装中,Grafana使用SQLite作为保留状态的默认解决方案。 问题在于,SQLite是用于开发目的,移动应用程序和许多其他作用域的出色数据库,但实际上并非用于实现HA和FT。 对于此范围,最好使用标准数据库,例如,我们可以使用具有多可用区功能(负责应用程序状态)的RDS Postgresql,这解决了我们的主要问题。
For the Grafana application itself and in order to provide users with a single entry point to interact with it, we can create an arbitrary number of equal instances of Grafana, configured to connect to the same RDS Postgresql. How many Grafana instances to create is up to your needs, you can scale them horizontally, and also vertically. Postgresql can also be installed on instances, but I’m lazy and I like to use services from cloud providers when they are able to do a great job and are not vendor locking. This is a perfect example that can make our lives easier.
对于Grafana应用程序本身,并为了为用户提供与之交互的单个入口点,我们可以创建任意数量的相等Grafana实例,并配置为连接到相同的RDS Postgresql。 您可以根据需要创建多少个Grafana实例,您可以水平或垂直缩放它们。 Postgresql也可以安装在实例上,但是我很懒,并且我喜欢使用云提供商的服务,因为他们能够做得很好并且没有锁定供应商。 这是一个可以使我们的生活更轻松的完美例子。
Now we need a Load Balancer which will be responsible for balancing the traffic between the N instances of Grafana and our users. We can also resolve our unfriendly Load Balancer address with a friendly DNS name.
现在,我们需要一个负载均衡器,该负载均衡器负责平衡Grafana的N个实例与我们的用户之间的流量。 我们还可以使用友好的DNS名称来解析不友好的负载均衡器地址。
Grafana can be connected to VictoriaMetrics vmselect Load Balancer using the datasource type Prometheus, and this closes our infrastructure for observability. Our infrastructure is now HA and FT in all of the components, configured to be resilient, scope focused, long term storage capable, and cost optimized. We can also add an automated process to create scheduled snapshots of the vmstorages and send them to an S3 bucket compatible, to make the retention period even longer.
可以使用数据源类型Prometheus将Grafana连接到VictoriaMetrics vmselect负载均衡器,这将关闭可观察性的基础结构。 现在,我们的基础架构在所有组件中都是HA和FT,配置为具有弹性,专注于范围,具有长期存储能力并优化了成本。 我们还可以添加一个自动化过程来创建vmstorages的计划快照,并将其发送到兼容的S3存储桶,以使保留期更长。
Well, this was the metrics part, we are still missing the logging part, but this is another story :)
好吧,这是指标部分,我们仍然缺少日志记录部分,但这是另一回事了:)
python|victoriametrics的prometheus高可用性和容错策略长期存储
文章图片
Grafana 格拉法纳 The complete architecture:
完整的架构:
python|victoriametrics的prometheus高可用性和容错策略长期存储
文章图片
Complete architecture 完整的架构 产品链接(Links to products)
  • Prometheus: https://prometheus.io/
    普罗米修斯: https : //prometheus.io/
  • AlertManager: https://github.com/prometheus/alertmanager
    AlertManager: https : //github.com/prometheus/alertmanager
  • BlackBox: https://github.com/prometheus/blackbox_exporter
    黑匣子: https : //github.com/prometheus/blackbox_exporter
  • PushGateway: https://github.com/prometheus/pushgateway
    PushGateway: https : //github.com/prometheus/pushgateway
  • 【python|victoriametrics的prometheus高可用性和容错策略长期存储】Exporters: https://prometheus.io/docs/instrumenting/exporters/
    出口商: https : //prometheus.io/docs/instrumenting/exporters/
  • Grafana: https://grafana.com/
    Grafana: https ://grafana.com/
  • VictoriaMetrics: https://victoriametrics.com/
    VictoriaMetrics: https : //victoriametrics.com/
加入我们的团队! (Join our team!) Would you like to be an Engineer, Team Lead or Engineering Manager at Miro? Check out opportunities to join the Engineering team.
您想成为Miro的工程师,团队负责人还是工程经理? 找出加入工程团队的机会。
翻译自: https://medium.com/miro-engineering/prometheus-high-availability-and-fault-tolerance-strategy-long-term-storage-with-victoriametrics-82f6f3f0409e

    推荐阅读