Low Cost Transactional and Analytics With MySQL and Clickhouse

下载 1

晚吟

发布于

2764

人观看

#信息技术

您希望MySQL协议的兼容性和近乎实时的仪表板分析与之兼容吗？
我们正在囤积数据，以期从中获得有意义的信息，从而做出明智的商业决策。但是，我们也会通过提高成本来满足（或不满足）这种需求。虽然有许多开源的滚动您自己的选项，它也操作成本高，有时维护压力大，更不用说每个缺点。如果您一直在使用MySQL存储数据，并且一直在使用同一组服务器运行查询，那么就不用担心了——Clickhouse可能正是您所需要的。在本演示中，我们将讨论如何部署、设计和维护一个Clickhouse分析平台，该平台可以以近乎实时的方式连续读取MySQL服务器中的数据。取决于您是否有转换或复杂的模式。

展开查看详情

1 . Low Cost Transactional and Analytics with MySQL + Clickhouse Have your Cake and Eat it Too

2 .Clickhouse Overview ... intentionally left blank ... 2 / 25

3 . Why Clickhouse Query language is very familiar to MySQL users Means low barrier of entry Lean design, highly parallel but not too resource greedy https://clickhouse.yandex/benchmark.html 3 / 25

4 . Why Not Clickhouse Highly dynamic data Especially those that spans multiple partitions Highly complex relationships between multiple large tables Multiple JOINs not possible at the time 4 / 25

5 . Our Use Case Requirement <15mins dashboard stats latency From MySQL db.r4.8xlarge (~$3.84/hr) 30s upwards existing dashboard queries Heavy caching, async 4+ RDS instances 5 / 25

6 . Our Use Case To Clickhouse Single i3.8xlarge (~$2.496/hr) AVG(5s) dashboard queries 15 queries 120 entities ~5mins dashboard lag 6 / 25

7 . How Replicating to Clickhouse

8 . Method #1 https://www.altinity.com/blog/2018/6/30/realtime-mysql- clickhouse-replication-in-practice UPDATE/DELETE ignored 8 / 25

9 . Method #2 Native MySQL connection INSERT INTO clickhouse_table SELECT * FROM mysql('host', 'db', 'table', 'user', password) 9 / 25

10 . Method #3 Roll our own Heavy UPDATE/DELETE Clickhouse UPDATE/DELETE implementation was not available at the time and have not tested Takes most of the concepts from method #1 Use partitions to update data Do not rely on consistent partitions But can easily correct partitions as needed Tables independently replicated 10 / 25

11 . Initial Table Import Constraints we can live with: Map MySQL table to Clickhouse de nition New columns will not be included until rebuild No replication but multiple Clickhouse nodes Table rebuild requires static replica 11 / 25

12 . Initial Table Import Process 1. Initialize dump worker Create dummy Clickhouse table Start dumping table in chunks 12 / 25

13 . Initial Table Import Process 2. Keep dumping chunks until Load CSV chunks as they complete Once dump completes, record binlog coordinates to metadata server 13 / 25

14 . Incremental Import Separate CDC metadata capture and import worker model 14 / 25

15 . Incremental Import One partition at a time, import speed per partition may predict how table is partitioned i.e. daily vs weekly 15 / 25

16 . Incremental Import Import partition to dummy table, swap out from real table This is not atomic 16 / 25

17 . Limitations Phantom reads during partition swap** Distributed lock, only one table instance being updated Add import latency, two CH instances was good 17 / 25

18 . Limitations max_execution_time being breached max_concurrent_queries Using async workers to queue and cache dashboard queries. 18 / 25

19 . Sample Queries (1) DISTINCT clause dillema SELECT SQL_NO_CACHE COUNT(*) AS hit_count, COUNT(DISTINCT(reflog.user_id)) AS visitors, SUM(reflog.time_amount) AS time_spent, AVG(time_amount) AS avg_time FROM reflog INNER JOIN page_uri pg ON pg.uri_id = reflog.uri_id WHERE reflog.entity_id = 396 AND (reflog.created_at BETWEEN '2017‑08‑13 05:00:00.000000' AND '2017‑09‑14 04:59:59.999999') AND reflog.status = 'statusA' AND (reflog.approved = 1) AND pg.entity_id = 396 AND pg.admin_id = 3275 +‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+ | hit_count | visitors | time_spent | avg_time | +‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+ | 2827576 | 2077654 | 60283159.65371944 | 21.319730982905302 | +‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑+ 1 row in set (31.57 sec) 19 / 25

20 . Sample Queries (1) SELECT COUNT(*) AS hit_count, COUNTDistinct(user_id) AS visitors, SUM(time_amount) AS time_spent, AVG(time_amount) AS avg_time FROM reflog WHERE (approved = 1) AND (status = 'statusA') AND ((created_at >= 1502600400) AND (created_at <= 1505365199)) AND (uri_id IN ( SELECT uri_id FROM page_uri WHERE (admin_id = 3275 AND entity_id = 396) )) AND entity_id = 396 ┌─hit_count─────┬─visitors───────┬─time_spent──────────┬────avg_time──────┐ │ 2827576 │ 2077654 │ 60283159.1976388 │ 21.31973082160791 │ └──────────────┴────────────────┴────────────────────┴─────────────────┘ 1 rows in set. Elapsed: 0.243 sec. Processed 6.09 million rows, 184.37 MB (25.10 million rows/s., 760.03 MB/s.) 20 / 25

21 . Sample Queries (2) Large date range (1month) Low cardinality index constant We can optimize to death ... but SELECT sum(`click_amount`) AS `click_amount` FROM `reflog` WHERE (entity_id = 594 AND created_at >= 1520793000 AND created_at <= 1523557799 AND id IN (( SELECT `event_id` FROM `user_logs` WHERE ((campaign_type = 'thisthatlogtype' AND log_id IN ((SELECT id FROM `some_log_types` WHERE (entity_id = 594))) AND control_group = 0) AND (`event_type` = 'login')))) AND approved = 1 AND status = 'statusA'); ... 57d1f674f8e75c4319e9a7a88afdd350 ‑ 1 row in set (12 min 30.40 sec) 21 / 25

22 . Sample Queries (2) There is still room for optimization in Clickhouse, but good enough SELECT sum(click_amount) AS click_amount FROM reflog WHERE (entity_id = 594) AND (created_at >= 1520793000) AND (created_at <= 1523557799) AND (id IN ( SELECT event_id FROM user_logs WHERE ((log_type = 'thisthatlogtype') AND (log_id IN ( SELECT id FROM some_log_types WHERE entity_id = 594 )) AND (control_group = 0)) AND (event_type = 'login') )) AND (approved = 1) AND (status = 'statusA') ┌─────click_amount───┐ │ 4771.6999979019165 │ └───────────────────┘ 1 rows in set. Elapsed: 5.403 sec. Processed 598.31 million rows, 22.94 GB (110.74 million rows/s., 4.25 GB/s.) 22 / 25

23 .Rate This Talk 23 / 25

24 . Thank you for joining us this year! ... and thanks to our sponsors! 24 / 25

25 .Questions?

0点赞

0收藏

1下载