一、概述
该监控告警系统主要是基于Prometheus与grafana完成。其中Prometheus主要完成对宿主机、容器、区块链数据指标的收集,Grafana主要完成对数据的展示和告警。
1、Prometheus原理与基本框架
Prometheus基本原理是通过HTTP协议周期性抓取被监控组件的状态,这样做的好处是任意组件只要提供HTTP接口就可以接入监控系统,Prometheus是为数不多的适合Docker、Kubernetes环境的监控系统之一。
Prometheus图中组件功能介绍:
Prometheus Server:核心组件,用于收集、存储监控数据。它同时支持静态配置和通过Service Discovery动态发现来管理监控目标,并从监控目标中获取数据。此外,Prometheus Server 也是一个时序数据库,它将监控数据保存在本地磁盘中,并对外提供自定义的 PromQL 语言实现对数据的查询和分析。
Exporter:用来采集数据,作用类似于agent,区别在于Prometheus是基于Pull方式拉取采集数据的,因此,Exporter通过HTTP服务的形式将监控数据按照标准格式暴露给Prometheus Server,社区中已经有大量现成的Exporter可以直接使用,用户也可以使用各种语言的client library自定义实现。
Push gateway:主要用于瞬时任务的场景,防止Prometheus Server来pull数据之前此类Short-lived jobs就已经执行完毕了,因此job可以采用push的方式将监控数据主动汇报给Push gateway缓存起来进行中转。
AlertManager:当告警产生时,Prometheus Server将告警信息推送给Alert Manager,由它发送告警信息给接收方。
Web UI:Prometheus内置了一个简单的web控制台,可以查询配置信息和指标等,而实际应用中我们通常会将Prometheus作为Grafana的数据源,创建仪表盘以及查看指标,在我们的监控系统中就是使用Grafana作为展示数据的主要工具
2、其他组件功能介绍
根据前面介绍Prometheus负责收集数据,Grafana负责展示数据。其中Prometheus 中的 Exporter主要是用来采集数据,在监控系统中使用了其几个组件,下面进行介绍
(1)Node Exporter:以容器方式运行在所有 host 上。可以监控当前机器自身的状态,包括硬盘、CPU、内存、流量、Disk等。在Prometheus看来,一台机器或者说一个节点就是一个node,所以该exporter是在上报当前节点的状态
(2)CAdvisor:以容器方式运行在所有 host 上。负责收集容器数据,比如fabric网络各个节点的信息等
告警系统则是Grafana自带的监控告警组件与Prometheus配合使用
二、监控指标与方案
1、宿主机监控指标
node_boot_time:系统启动时间
node_cpu*:系统CPU使用相关指标
node_disk*:磁盘IO使用相关指标
node_filesystem*:文件系统使用相关指标
node_load*:系统负载使用相关指标
node_memory*:内存使用量使用相关指标
node_network*:网络带宽使用相关指标
go_*:node exporter中go相关指标
process_*:node exporter自身进程相关运行指标
2、容器监控指标
container_cpu*:容器CPU使用相关指标
container_fs*:容器文件系统使用相关指标
container_memory*:容器内存使用相关指标
container_network*:容器网络使用相关指标
3、Fabric网络监控指标
1> Orderer监控指标
Name | Description |
---|---|
blockcutter_block_fill_duration | The time from first transaction enqueing to the block being cut in seconds. |
broadcast_enqueue_duration | The time to enqueue a transaction in seconds. |
broadcast_processed_count | The number of transactions processed. |
broadcast_validate_duration | The time to validate a transaction in seconds. |
consensus_etcdraft_cluster_size | Number of nodes in this channel. |
consensus_etcdraft_committed_block_number | The block number of the latest block committed. |
consensus_etcdraft_config_proposals_received | The total number of proposals received for config type transactions. |
consensus_etcdraft_data_persist_duration | The time taken for etcd/raft data to be persisted in storage (in seconds). |
consensus_etcdraft_is_leader | The leadership status of the current node: 1 if it is the leader else 0. |
consensus_etcdraft_leader_changes | The number of leader changes since process start. |
consensus_etcdraft_normal_proposals_received | The total number of proposals received for normal type transactions. |
consensus_etcdraft_proposal_failures | The number of proposal failures. |
consensus_etcdraft_snapshot_block_number | The block number of the latest snapshot. |
consensus_kafka_batch_size | The mean batch size in bytes sent to topics. |
consensus_kafka_compression_ratio | The mean compression ratio (as percentage) for topics. |
consensus_kafka_incoming_byte_rate | Bytes/second read off brokers. |
consensus_kafka_last_offset_persisted | The offset specified in the block metadata of the most recently committed block. |
consensus_kafka_outgoing_byte_rate | Bytes/second written to brokers. |
consensus_kafka_record_send_rate | The number of records per second sent to topics. |
consensus_kafka_records_per_request | The mean number of records sent per request to topics. |
consensus_kafka_request_latency | The mean request latency in ms to brokers. |
consensus_kafka_request_rate | Requests/second sent to brokers. |
consensus_kafka_request_size | The mean request size in bytes to brokers. |
consensus_kafka_response_rate | Requests/second sent to brokers. |
consensus_kafka_response_size | The mean response size in bytes from brokers. |
deliver_blocks_sent | The number of blocks sent by the deliver service. |
deliver_requests_completed | The number of deliver requests that have been completed. |
deliver_requests_received | The number of deliver requests that have been received. |
deliver_streams_closed | The number of GRPC streams that have been closed for the deliver service. |
deliver_streams_opened | The number of GRPC streams that have been opened for the deliver service. |
fabric_version | The active version of Fabric. |
grpc_comm_conn_closed | gRPC connections closed. Open minus closed is the active number of connections. |
grpc_comm_conn_opened | gRPC connections opened. Open minus closed is the active number of connections. |
grpc_server_stream_messages_received | The number of stream messages received. |
grpc_server_stream_messages_sent | The number of stream messages sent. |
grpc_server_stream_request_duration | The time to complete a stream request. |
grpc_server_stream_requests_completed | The number of stream requests completed. |
grpc_server_stream_requests_received | The number of stream requests received. |
ledger_blockchain_height | Height of the chain in blocks. |
ledger_blockstorage_commit_time | Time taken in seconds for committing the block to storage. |
logging_entries_checked | Number of log entries checked against the active logging level |
2> Peer监控指标
Name | Description |
---|---|
chaincode_launch_duration | The time to launch a chaincode. |
chaincode_shim_request_duration | The time to complete chaincode shim requests. |
chaincode_shim_requests_completed | The number of chaincode shim requests completed. |
chaincode_shim_requests_received | The number of chaincode shim requests received. |
deliver_blocks_sent | The number of blocks sent by the deliver service. |
deliver_requests_completed | The number of deliver requests that have been completed. |
deliver_requests_received | The number of deliver requests that have been received. |
deliver_streams_closed | The number of GRPC streams that have been closed for the deliver service. |
deliver_streams_opened | The number of GRPC streams that have been opened for the deliver service. |
dockercontroller_chaincode_container_build_duration | The time to build a chaincode image in seconds. |
endorser_proposal_duration | The time to complete a proposal. |
endorser_proposals_received | The number of proposals received. |
endorser_successful_proposals | The number of successful proposals. |
fabric_version | The active version of Fabric. |
gossip_comm_messages_received | Number of messages received |
gossip_comm_messages_sent | Number of messages sent |
gossip_leader_election_leader | Peer is leader (1) or follower (0) |
gossip_membership_total_peers_known | Total known peers |
gossip_payload_buffer_size | Size of the payload buffer |
gossip_privdata_commit_block_duration | Time it takes to commit private data and the corresponding block (in seconds) |
gossip_privdata_fetch_duration | Time it takes to fetch missing private data from peers (in seconds) |
gossip_privdata_list_missing_duration | Time it takes to list the missing private data (in seconds) |
gossip_privdata_purge_duration | Time it takes to purge private data (in seconds) |
gossip_privdata_reconciliation_duration | Time it takes for reconciliation to complete (in seconds) |
gossip_privdata_validation_duration | Time it takes to validate a block (in seconds) |
gossip_state_commit_duration | Time it takes to commit a block in seconds |
gossip_state_height | Current ledger height |
grpc_comm_conn_closed | gRPC connections closed. Open minus closed is the active number of connections. |
grpc_comm_conn_opened | gRPC connections opened. Open minus closed is the active number of connections. |
grpc_server_stream_messages_received | The number of stream messages received. |
grpc_server_stream_messages_sent | The number of stream messages sent. |
grpc_server_stream_request_duration | The time to complete a stream request. |
grpc_server_stream_requests_completed | The number of stream requests completed. |
grpc_server_stream_requests_received | The number of stream requests received. |
grpc_server_unary_request_duration | The time to complete a unary request. |
grpc_server_unary_requests_completed | The number of unary requests completed. |
grpc_server_unary_requests_received | The number of unary requests received. |
ledger_block_processing_time | Time taken in seconds for ledger block processing. |
ledger_blockchain_height | Height of the chain in blocks. |
ledger_blockstorage_and_pvtdata_commit_time | Time taken in seconds for committing the block and private data to storage. |
ledger_blockstorage_commit_time | Time taken in seconds for committing the block to storage. |
ledger_statedb_commit_time | Time taken in seconds for committing block changes to state db. |
ledger_transaction_count | Number of transactions processed. |
logging_entries_checked | Number of log entries checked against the active logging level |
logging_entries_written | Number of log entries that are written |
三、监控组件的安装与部署
所有组件基于docker-compose部署
1、添加docker-compose.yml配置文件
主要包含了prometheus、grafana、node-exporter、cadvisor容器部署、挂载、端口配置、依赖配置文件等
version: '2'
networks:
monitor:
external:
name: net_byfn
services:
prometheus:
image: prom/prometheus
container_name: prometheus
hostname: prometheus
restart: always
volumes:
- /home/centos/config/prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
networks:
- monitor
grafana:
image: grafana/grafana
container_name: grafana
hostname: grafana
restart: always
ports:
- "3000:3000"
networks:
- monitor
node-exporter:
image: prom/node-exporter
container_name: exporter
hostname: node-exporter
restart: always
volumes:
- /proc/:/host/proc
- /sys/:/host/sys
- /:/rootfs
ports:
- "9100:9100"
networks:
- monitor
cadvisor:
image: google/cadvisor:latest
command: "--enable_load_reader=true"
container_name: cadvisor
hostname: cadvisor
restart: always
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8080:8080"
networks:
- monitor
复制代码
2、添加docker-compose.yml依赖的文件
prometheus容器中依赖的配置文件,prometheus.yml。定义了job的名称:job_name,定义监控节点:targets
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped
# from this config.
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090','exporter:9100','cadvisor:8080']
- job_name: 'fabric'
static_configs:
- targets: ['peer0.org1.example.com:7443']
- targets: ['peer1.org1.example.com:8443']
- targets: ['peer0.org2.example.com:9443']
- targets: ['peer1.org2.example.com:10443']
- targets: ['orderer.example.com:6443']
复制代码
若其他组件有依赖的配置文件,也需要添加对应的yml。例如警告里配置告警邮箱之类的
3、启动docker-compose
#启动容器:
docker-compose -f docker-compose.yml up -d
复制代码
#删除容器:
docker-compose -f docker-compose.yml down
#重启容器:
docker restart id
复制代码
启动的容器如下
4、区块链Fabric指标配置与收集
这里要监控fabric网络的数据,就需要对peer和orderer做一些配置修改。第一步,配置prometheus只需要配置provider为prometheus。第二步,由于采用拉的模式,需要peer和orderer提供对外的端口。
1> 修改peer的metrics的provider项为prometheus
vi github.com/hyperledger/fabric/sampleconfig/core.yaml
复制代码
2> 修改orderer的metrics的provider项为prometheus
vi github.com/hyperledger/fabric/sampleconfig/orderer.yaml
复制代码
3> 添加peer以及orderer提供的对外接口
/fabric/fabric-samples/first-network/base/docker-compose-base.yaml
复制代码
其他的peer配置也类似,端口不重复即可
5、启动fabric网络
./byfn.sh up
复制代码
若已启动,因为改了配置文件,则需要先down掉byfn网络,然后再重启
./byfn.sh down
复制代码
6、查看prometheus是否正常启动
在浏览器访问[机器IP:端口]就可以查看Prometheus的界面了,这里的机器IP是你运行Prometheus的机器,端口是上面配置文件中配置的监控自己的端口。打开后界面如下,target的status是“UP”的话,就说明监听成功了。
X.X.X.X:9090/targets
复制代码
可以看到prometheus的2个job已经起来
7、查看其他组件收集到的数据,以cadvisor为例,其他组件都相似
X.X.X.X:8080/metrics
复制代码
8、查看grafana的状态
#访问grafana
http://X.X.X.X:3000/
#用户名/密码名(初始) admin/admin
复制代码
展示数据如图:
四、Grafana模块使用
Grafana是什么:
Grafana 是近几年兴起的开源可视化工具,采用 Go 语言所编写,天然支持 Prometheus,不仅如此,Grafana 还支持多种数据源,包括 Elasticsearch,InfluxDB,MySQL,OpenTSDB。Grafana 为 Prometheus 添加一个功能较为全面的可视化平台。以下讲解如何使用grafana展示监控的数据
1、登录
在浏览器输入:http://X.X.X.X:3000/ 便可以访问grafana服务。初始用户/密码是admin/admin
进入主界面如下:
2、添加prometheus数据源
3、添加模板
其中893是一个常用的模板,import之后可以看到基本的框架了
4、如何创建一个panel
可以看到上面的每个监控项,那我们怎么创建一个自己想要监控的数据的panel呢
主要是根据PromQL的查询语句来显示指标,也可以根据job、instance来过滤。这里可以监控机器相关数据,也可以监控容器以及fabric的相关数据。直接在Metrics一栏选择即可
5、添加全局筛选项
例如有需求筛选某一个容器的CPU、Mem数据
1、添加设置
2、Pannel加上添加的全局容器变量
3、此时主界面会有容器查询的筛选项,对某个容器进行筛选就可以看到某个容器的所有的指标!
五、告警模块介绍
grafana可以无缝定义告警在数据中的位置,可视化的定义阈值,并可以通过钉钉、email等平台获取告警通知。最重要的是可直观的定义告警规则,不断的评估并发送通知。现在以使用email方式通知告警为例进行配置:
1、添加告警
1> 修改grafana.ini中的告警配置,可以在grafana容器挂载到宿主机的位置找到该文件
cd /var/lib/docker/overlay2/4464e288198b02c9b3cf89fa9bdafe2716e6aea50f59a671787949a4c6a03bdf/merged/etc/grafana
复制代码
修改邮件SMTP服务器
2> Grafana添加告警,添加告警接收邮件
3> 面板添加告警,这里只支持阈值告警,且对每一个具体的实例才能添加告警,这也是grafana告警的局限性
至此,所以的告警配置完成
2、告警展示
在数据达到告警的条件时,会发送邮件给配置的邮箱。邮箱会展示告警信息和告警值