⌘K
Change language Switch ThemeSign In
Narrow Mode
Clickhouse 迁移数据到 Doris,有哪些坑? ============================
d dbaplus社群 @dbaplus社群
One Sentence Summary
This article shares practical experience in migrating tens of billions of data records from Clickhouse to Doris, focusing on comparing the pros, cons, and pitfalls of three solutions: Catalog direct connection, file import, and Spark engine.
Summary
This article thoroughly explores three data migration paths to Doris in the context of Clickhouse cluster decommissioning. Through practical experience, the author found that while Doris's Clickhouse Catalog solution is the simplest to operate, it frequently triggers JDBC exceptions when handling large local tables and sharded tables, leading to insufficient stability. In contrast, exporting Clickhouse data as TSV files (to avoid comma conflicts) and then importing it using Doris's Stream Load proved to be a balanced solution that combines efficiency and success rate. The author emphasizes that for heterogeneous database migration, there is no one-size-fits-all tool. A combination of multiple strategies (e.g., direct connection for small tables, offline import for large tables, and Spark for complex tables) must be employed based on table size and field complexity to ensure the task's ultimate success.
Main Points
* 1. Doris's Clickhouse Catalog solution faces stability bottlenecks when handling large-scale or distributed tables.Although the Catalog solution supports direct SQL migration, practical tests show it's only effective for small-scale local tables. It frequently reports errors due to JDBC driver issues when dealing with local tables with tens of millions of rows or sharded tables of any size. * 2. Using TSV format with Stream Load is currently the most cost-effective and robust migration method.Compared to CSV, TSV uses tab delimiters to effectively prevent data corruption caused by commas within fields; Stream Load can import 14 million rows of data in less than 20 seconds, demonstrating extremely high efficiency. * 3. Heterogeneous database migration requires a conservative approach involving a 'combination of multiple strategies' based on data characteristics.No single solution can solve all problems. In practice, one should first manually align the table structure on the target side, then choose between direct connection, file import, or Spark engine based on the size and complexity of each table.
Metadata
AI Score
82
Website mp.weixin.qq.com
Published At Yesterday
Length 1941 words (about 8 min)
Sign in to use highlight and note-taking features for a better reading experience. Sign in now
Anryg(安瑞哥)2026-03-13 07:15 广东
以下文章来源于:安瑞哥是码农 安瑞哥是码农 一只帮你撕开技术遮羞布,还原技术真相的资深码农!
想要迁移后的效果好,就一定要采用多种策略组合!
最近,有个 Clickhouse 集群因为项目已经到期交付,需要把服务器资源给腾退出来,但考虑到里面存了很多「将来可能还会用上」的数据(约100亿规模),就想着把它给备份出来。
之前的项目,核心 DB 用的都是 CK ,说实话,已经有些玩腻了,这一次,我决定把这部分数据,给迁到 Doris ,然后在它身上,做大做强。
(PS:当前 CK 版本为 25.x,Doris 版本为 2.1.2) 一、 迁移方式
根据我以往的经验,以及问过 AI 之后,可行的方案有 3 个: 方案 1
利用 Doris 的 clickhouse catalog + insert into ... select * from 的方式; 方案 2
先把 CK 每张表的数据给导出成文件,再利用 Doris 支持的 Broker Load 或者 Stream Load,把数据写入到 Doris 表里; 方案 3
利用 Spark 引擎,读 CK 里面的每一张表数据,然后写入到 Doris 目标表里。
其中方案1、2是 AI 跟我都想到的,根据之前的实测经验,方案1最简单且高效,但 bug 最多。
方案2需要分两步走,麻烦一些,但效率也不错,bug 比方案1更少。
方案3是我认为最靠谱,最健壮的方式,也是被我验证了无数次的方案,但因为需要编码,跟额外的计算资源,所以最麻烦。
那么接下来,我们就逐个验证这些迁移方式(验证1跟2),看哪个最靠谱? 二、验证方案 1
先说结论:不靠谱。
这里的不靠谱,并不完全不行,而是——对于 CK 的表类型来说,「小本地表」可以,但所有的「分片表」,以及「大本地表」就不行了。
具体玩法是这样的:
首先,在 Doris 创建 CK 的catalog:
CREATE CATALOG jdbc_clickhouse01 PROPERTIES (
"user" = "default",
"type" = "jdbc",
"password" = "***",
"jdbc_url" = "jdbc:clickhouse://192.168.xxx.xxx:8123",
"driver_url" = "clickhouse-jdbc-0.4.6-all.jar",
"driver_class" = "com.clickhouse.jdbc.ClickHouseDriver"
);
这里需要注意的是,这个 CK 的 jar 包,如果不是当场下载(服务器不能连外网时),就需要你提前把这个包下载下来后,放到 Doris 每台机器的 /usr/share/java 目录下。
然后,切换 catalog ,就可以看到对应的表了。
本来以为这样,就可以很简单的通过 insert into... select * from... 的方式把数据给迁移过去。
然而,在实际验证的时候,这种情况只对「本地小表」有效。
具体多小呢?来看这个:
mysql> selectcount(*) from local_small_table01;
+----------+
| count(*) |
+----------+
| 225 |
+----------+
1 row inset (0.20 sec)
mysql> insert into internal.doris_db.local_small_table01 select * from jdbc_clickhouse01.ck_db.target_table01;
Query OK, 225 rows affected (0.35 sec)
{'label':'insert_24c4e07f37914f96_8bcb5272c042236b', 'status':'VISIBLE', 'txnId':'13429378'}
);
这是可以的。
但同样是本地表,如果数据量再大一些,比如千万级别,就不行了。
关键,连查询都不行:
mysql> select count(*) from jdbc_clickhouse01.ck_db.local_big_table02;
ERROR 1105 (HY000): errCode = 2, detailMessage = (192.168.xxx.xxx)[CANCELLED]UdfRuntimeException: jdbc get block address:
CAUSED BY: IndexOutOfBoundsException: Index: 0, Size: 0
至于这个错什么意思,以我现在的水平,确实看不懂(这个表都是些简单字段类型)。
而至于「分片表」,甭管数据量大小,都不行,报的错跟上面一样。
所以目前为止,我只能拿它迁移了几张小表(具体的数据量上限,我没来得及测)。 三、验证方案 2
相比方案 1,这个方案要复杂一些,需要分为 2 步。
先把 CK 的目标表导出成文件。
CK 官网支持的导出文件格式有:CSV、TSV、JSON、Parquet 等。
CK 支持的导出数据格式
但因为导出的数据文件,最终还需要再次写入到 Doris,所以这个格式,还必须得是 Doris 支持导入的。
为了方便,这里 优先用 Doris 的 stream load。
Doris 支持的导入数据格式
虽然支持的也比较丰富,但经过我的实测,最简单,性价比最高的导出导入方式是这样的。
CK 端:采用 TSV 的导出方式——也就是 tab 键(\t)分割的 CSV。
比如像这样:
clickhouse-client --query="SELECT * FROM ck_db.ck_big_table FORMAT TSV" > /tmp/ck_big_table.tsv
然后,把这个文件,统一挪到 Doris 的客户端节点。
Doris 端:指定分隔符为 \t,采用 CSV 的导入方式。
比如像这样:
[root@hdp01 ~]# curl --location-trusted -u doris_user:**** -H "label:label01" -H "column_separator:\t" -T ck_big_table.tsv http://192.168.xxx.xxx:8030/api/doris_db/ck_big_table/_stream_load
{
"TxnId": 13429556,
"Label": "label01",
"Comment": "",
"TwoPhaseCommit": "false",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 1406520,
"NumberLoadedRows": 1406520,
"NumberFilteredRows": 0,
"NumberUnselectedRows": 0,
"LoadBytes": 1290174219,
"LoadTimeMs": 11456,
"BeginTxnTimeMs": 1,
"StreamLoadPutTimeMs": 10,
"ReadDataTimeMs": 7769,
"WriteDataTimeMs": 11165,
"CommitAndPublishTimeMs": 279
}
这里之所以没用「,」号分割(正规的 csv),在于某个字段里面的数据,本身就有很多逗号,会导致数据入库的时候产生错乱。
至于为什么没有采用 json 或者其他格式,原因在于——需要在命令行额外添加对应表的 schema,太麻烦(其实也说明 Doris 这块没做好)。
从最终入库效果来看,暂时没发现问题。
而至于效率,无论是 CK 的导出,还是 Doris 的导入,都非常快。
我大概测了一下,1.4kw 的数据量,CK 的导出时间为 2 秒不到。
而 Doris 利用 stream load 的导入,也只有不到 20 秒。 四、最后
可能有同学会觉得,这样一张一张蚂蚁搬家似的迁移,是不是有点繁琐?
目前对于这种「异构」DB 之间的迁移来说,想要迁移后的效果好,以我的经验来看,确实没有特别高效的,现成的办法(除非你定制开发一个)。
比较保守的做法,就只能是先在 Doris 端,根据 CK 的表结构跟字段类型特点,挨个建对应的表。
然后,再通过 CK catalog + Doris stream load 组合拳的方式,把每张大大小小的表数据,给折腾进去(对于部分复杂字段的表,可能后续还得用上 Spark)。
还是那句话,理论上说行的,真刀真枪玩起来的时候,不一定就真的行,对于这种生产库的数据迁移问题,我就没见过谁,可以用一种「既高效、又简单、还优雅」的方式能一招鲜吃遍天的,大家都是通过「多种策略组合」才能最终把问题搞定。
最后,对于这个 CK 迁移 Doris 的方案,你怎么看?
作者丨Anryg(安瑞哥)
来源丨公众号:安瑞哥是码农(ID:gh_c12dc29ae2e7)
dbaplus社群欢迎广大技术人员投稿,投稿邮箱:editor@dbaplus.cn
d dbaplus社群 @dbaplus社群
One Sentence Summary
This article shares practical experience in migrating tens of billions of data records from Clickhouse to Doris, focusing on comparing the pros, cons, and pitfalls of three solutions: Catalog direct connection, file import, and Spark engine.
Summary
This article thoroughly explores three data migration paths to Doris in the context of Clickhouse cluster decommissioning. Through practical experience, the author found that while Doris's Clickhouse Catalog solution is the simplest to operate, it frequently triggers JDBC exceptions when handling large local tables and sharded tables, leading to insufficient stability. In contrast, exporting Clickhouse data as TSV files (to avoid comma conflicts) and then importing it using Doris's Stream Load proved to be a balanced solution that combines efficiency and success rate. The author emphasizes that for heterogeneous database migration, there is no one-size-fits-all tool. A combination of multiple strategies (e.g., direct connection for small tables, offline import for large tables, and Spark for complex tables) must be employed based on table size and field complexity to ensure the task's ultimate success.
Main Points
* 1. Doris's Clickhouse Catalog solution faces stability bottlenecks when handling large-scale or distributed tables.
Although the Catalog solution supports direct SQL migration, practical tests show it's only effective for small-scale local tables. It frequently reports errors due to JDBC driver issues when dealing with local tables with tens of millions of rows or sharded tables of any size.
* 2. Using TSV format with Stream Load is currently the most cost-effective and robust migration method.
Compared to CSV, TSV uses tab delimiters to effectively prevent data corruption caused by commas within fields; Stream Load can import 14 million rows of data in less than 20 seconds, demonstrating extremely high efficiency.
* 3. Heterogeneous database migration requires a conservative approach involving a 'combination of multiple strategies' based on data characteristics.
No single solution can solve all problems. In practice, one should first manually align the table structure on the target side, then choose between direct connection, file import, or Spark engine based on the size and complexity of each table.
Key Quotes
* For Clickhouse table types, 'small local tables' work, but all 'sharded tables' and 'large local tables' do not. * The reason for not using comma separation (standard CSV) is that data within certain fields already contains many commas, which would lead to data corruption during import. * I tested it, and for 14 million data records, Clickhouse's export time was less than 2 seconds. Doris's import using Stream Load also took less than 20 seconds. * For production database data migration issues like this, I've never seen anyone use a 'single, universally effective solution' that is both efficient, simple, and elegant.
AI Score
82
Website mp.weixin.qq.com
Published At Yesterday
Length 1941 words (about 8 min)
Tags
Clickhouse
Apache Doris
Data Migration
Database Operations
Stream Load
Related Articles
* 3 Million Lines of Code to 300K: A Retrospective on Tencent News' Recommendation Architecture Refactoring * Message Queue Backlog in the Millions: Beyond Scaling Out, What Are Other Solutions? * Demystifying Milvus Vector Indexes: A Guide for Traditional DBAs * Stop Blindly Adding Indexes! Master These 83 Scenarios to Truly Understand SQL Optimization * 1.5 Million People Flee ChatGPT Overnight: A Must-Have Relocation Guide for the AI Era * 15 Key Trade-offs in System Design * Stripe's Zero-Downtime Data Movement Platform Migrates Petabyte-Scale Data with Millisecond-Fast Traffic Cutover HomeArticlesPodcastsVideosTweets
Pitfalls of Migrating Data from Clickhouse to Doris | Bes... ===============