- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
OpenStack服务监控深度分析
展开查看详情
1 .In-depth monitoring for Openstack services George Mihaiescu, Senior Cloud Architect Jared Baker, Cloud Specialist
2 .2 The infrastructure team George Mihaiescu Cloud architect for the Cancer Genome Collaboratory 7 years of Openstack experience First deployment - Cactus First conference - Boston 2011 Openstack speaker at Barcelona, Boston and Vancouver conferences Jared Baker Cloud specialist for the Cancer Genome Collaboratory 2 years of Openstack experience 10 years MSP experience First deployment - Liberty First conference (and speaker - Boston 2017
3 .3 Ontario Institute for Cancer Research (OICR) Largest cancer research institute in Canada, funded by the government of Ontario Together with its collaborators and partners supports more than 1,700 researchers, clinician scientists research staff and trainees
4 .4 4 Cancer Genome Collaboratory Project goals and motivation Cloud computing environment built for biomedical research by OICR, and funded by government of Canada grants Enables large scale cancer research on the world’s largest cancer genome dataset currently produced by the International Cancer Genome Consortium (ICGC) Entirely built using open-source software like Openstack and Ceph Compute infrastructure goal to provide 3,000 cores and 15 PB storage A system for cost-recovery
5 .No frills design 5 Use high density commodity hardware to reduce physical footprint & related overhead Use open source software and tools Prefer copper over fiber for network connectivity Spend 100% of the hardware budget on the infrastructure that supports cancer research, not on licenses or “nice to have” features
6 .Hardware architecture Compute nodes 6
7 .7 Hardware architecture Ceph storage nodes
8 .Openstack controllers Three controllers in HA configuration (2 x 6 cores CPU, 128 GB RAM, 6 x 200 GB Intel S3700 SSD drives) Separate partitions for OS, Ceph Mon and MySQL Haproxy (SSL termination with ECC certs) and Keepalived 4 x 10 GbE bonded interfaces, 802.3ad, layer 3+4 hash Neutron + GRE, HA routers, no DVR 8
9 .Networking Ruckus ICX 7750-48C top-of-rack switches configured in a stack ring topology 6 x 40Gb Twinax cables between the racks, providing 240 Gbps non-blocking redundant connectivity (2:1 oversubscription ratio) 9
10 .Capacity vs. extreme performance 10
11 .Upgrades 11
12 .Software – entirely open source 12
13 .Rally – end-to-end tests Rally test that runs every hour and does end-to-end chec k Starts a VM Assigns floating IP Connects over SSH Pings an external host five times Alert if the check fails, takes too long to complete or packet loss is greater than 40% It sends runtime info to Graphite for long term graphing. Grafana a lerts us if average runtime is above a threshold. 13
14 .Rally – RBD volume performance test Another Rally check monitors RBD volume (Ceph based) write performance over time: it boots an instance from a volume it a ssigns a floating I P it c onnects over SSH it runs a script that wr ites a 10 GB file three times it captures the average IO throughput at the end i t sends throughput info to Graphite for long term graphing it a lerts if the average runtime is above the threshold 14
15 .Rally– custom checks 15
16 .Rally smoke tests & load tests 16
17 .Zabbix and Grafana 17
18 .Zabbix and Grafana 18
19 .Dockerized monitoring stack We run a number of tools in containers: Sflowtrend Prometheus Graphite Collectd Grafana Ceph_exporter Elasticsearch Logstash Kibana 19
20 .Ceph Monitoring IOPS 20
21 .Ceph Monitoring Performance & Integrity 21
22 .Openstack c apacity usage 22
23 .Sflowtrend 23
24 .Zabbix 200+ hosts 38,000+ items 15,000+ triggers Performant Reliable Customizable 24 https://github.com/CancerCollaboratory/infrastructure
25 .Zabbix The Zabbix Agent (client) CPU Disk I/O Memory Filesystem Security Services running HW Raid card Fans, temperature, power supply status PDU power usage 25
26 .Zabbix C ustom checks When security updates are available When new cloud images are released Number of IPs banned by fail2ban Iptables rules across all controllers are in sync Openvswitch ports tagged with VLAN 4095 (bad) Number of Cinder volumes != Number of RBD volumes Agg memory us e per process type (e.g. Nova-api, Radosgw, etc) Compute nodes have the “neutron-openvswi-sg-chain” 26 openstack volume list --all -f value -c ID >> /tmp/rbdcindervolcompare rbd -p volumes ls | sed "s/volume-//" >> /tmp/rbdcindervolcompare sort /tmp/rbdcindervolcompare | uniq -u
27 .Zabbix Openstack APIs Multiple checks per API: Is the process running ? Is the port listening? Internal checks (from each controller) External checks (from monitoring server) Memory usage aggregated per process type Response time, number and type of API calls 27
28 .Zabbix Open S tack services memory usage 28
29 .Zabbix Neutron router traffic 29