這篇文章主要介紹“在分布式數據庫中怎么計算count”,在日常操作中,相信很多人在在分布式數據庫中怎么計算count問題上存在疑惑,小編查閱了各式資料,整理出簡單好用的操作方法,希望對大家解答”在分布式數據庫中怎么計算count”的疑惑有所幫助!接下來,請跟著小編一起來學習吧!
創(chuàng)新互聯服務項目包括田東網站建設、田東網站制作、田東網頁制作以及田東網絡營銷策劃等。多年來,我們專注于互聯網行業(yè),利用自身積累的技術優(yōu)勢、行業(yè)經驗、深度合作伙伴關系等,向廣大中小型企業(yè)、政府機構等提供互聯網行業(yè)的解決方案,田東網站推廣取得了明顯的社會效益與經濟效益。目前,我們服務的客戶以成都為中心已經輻射到田東省份的部分城市,未來相信會繼續(xù)擴大服務區(qū)域并繼續(xù)獲得客戶的支持與信任!在分布式數據庫中,計算count(distinct xxx),需要對distinct 的字段,
1、去重,
2、重分布去重后的數據,(這一步,如果distinct值特別多,那么就會比較耗時)
3、然后再去重,
4、最后count (xxx),
5、求所有節(jié)點的count SUM。
例如,以下是Greenplum的執(zhí)行計劃例子
postgres=# explain analyze select count(distinct c_acctbal) from customer; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=182242.41..182242.42 rows=1 width=8) Rows out: 1 rows with 0.006 ms to first row, 69 ms to end, start offset by 23 ms. -> Gather Motion 16:1 (slice2; segments: 16) (cost=53392.85..173982.82 rows=660767 width=8) Rows out: 818834 rows at destination with 3.416 ms to first row, 447 ms to end, start offset by 23 ms. -> HashAggregate (cost=53392.85..61652.43 rows=41298 width=8) Group By: customer.c_acctbal Rows out: Avg 51177.1 rows x 16 workers. Max 51362 rows (seg3) with 0.004 ms to first row, 33 ms to end, start offset by 25 ms. -> Redistribute Motion 16:16 (slice1; segments: 16) (cost=30266.00..43481.34 rows=41298 width=8) Hash Key: customer.c_acctbal Rows out: Avg 89865.6 rows x 16 workers at destination. Max 90305 rows (seg3) with 18 ms to first row, 120 ms to end, start offset by 25 ms. -> HashAggregate (cost=30266.00..30266.00 rows=41298 width=8) Group By: customer.c_acctbal Rows out: Avg 89865.6 rows x 16 workers. Max 89929 rows (seg2) with 0.007 ms to first row, 33 ms to end, start offset by 26 ms. -> Append-only Columnar Scan on customer (cost=0.00..22766.00 rows=93750 width=8) Rows out: Avg 93750.0 rows x 16 workers. Max 93751 rows (seg4) with 20 ms to first row, 30 ms to end, start offset by 26 ms. Slice statistics: (slice0) Executor memory: 387K bytes. (slice1) Executor memory: 6527K bytes avg x 16 workers, 6527K bytes max (seg0). (slice2) Executor memory: 371K bytes avg x 16 workers, 371K bytes max (seg0). Statement statistics: Memory used: 1280000K bytes Optimizer status: legacy query optimizer Total runtime: 723.143 ms (23 rows)
以下是citus的例子
postgres=# explain analyze select count(distinct bid) from pgbench_accounts ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=0.00..0.00 rows=0 width=0) (actual time=31.748..31.749 rows=1 loops=1) -> Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0) (actual time=31.382..31.510 rows=1280 loops=1) Task Count: 128 Tasks Shown: One of 128 -> Task Node: host=172.24.211.224 port=1921 dbname=postgres -> HashAggregate (cost=231.85..231.95 rows=10 width=4) (actual time=3.700..3.702 rows=10 loops=1) Group Key: bid -> Seq Scan on pgbench_accounts_106812 pgbench_accounts (cost=0.00..212.48 rows=7748 width=4) (actual time=0.017..2.180 rows=7748 loops=1) Planning time: 0.445 ms Execution time: 3.781 ms Planning time: 1.399 ms Execution time: 32.159 ms (13 rows)
對于可估值計算的場景,即不需要精確distinct值的場景,PostgreSQL提供了一個名為hll的插件,可以用來估算distinct元素個數。
citus 結合hll,可以實現超高速的count(distinct xxx),即使distinct值非常非常多,也不慢。
SET citus.count_distinct_error_rate to 0.005; 0.005表示失真度
1、所有節(jié)點(coordinator 與 worker節(jié)點),安裝hll軟件
yum install -y gcc-c++ cd ~/ git clone https://github.com/citusdata/postgresql-hll cd postgresql-hll . /var/lib/pgsql/.bash_profile USE_PGXS=1 make USE_PGXS=1 make install
2、所有節(jié)點(coordinator 與 worker節(jié)點),在需要用到HLL的DB中增加插件
su - postgres -c "psql -d postgres -c 'create extension hll;'" su - postgres -c "psql -d newdb -c 'create extension hll;'"
create table test (id int primary key, a int, b int, c int); set citus.shard_count =128; select create_distributed_table('test', 'id');
insert into test select id, random()*9, random()*99, random()*999999 from generate_series(1,1000000000) t(id);
SET citus.count_distinct_error_rate to 0.005; newdb=# explain select count(distinct bid) from pgbench_accounts group by bid; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=0.00..0.00 rows=0 width=0) Group Key: remote_scan.worker_column_2 -> Custom Scan (Citus Real-Time) (cost=0.00..0.00 rows=0 width=0) Task Count: 128 Tasks Shown: One of 128 -> Task Node: host=172.24.211.224 port=8001 dbname=newdb -> GroupAggregate (cost=97272.79..105102.29 rows=1000 width=36) Group Key: bid -> Sort (cost=97272.79..99227.04 rows=781700 width=4) Sort Key: bid -> Seq Scan on pgbench_accounts_102008 pgbench_accounts (cost=0.00..20759.00 rows=781700 width=4) (12 rows)
newdb=# set citus.count_distinct_error_rate to 0; newdb=# select count(distinct bid) from pgbench_accounts; count ------- 1000 (1 row) Time: 423.364 ms postgres=# set citus.count_distinct_error_rate to 0; postgres=# select count(distinct a) from test; count ------- 10 (1 row) Time: 2392.709 ms (00:02.393)
newdb=# set citus.count_distinct_error_rate to 0.005; newdb=# select count(distinct bid) from pgbench_accounts; count ------- 1000 (1 row) Time: 444.287 ms postgres=# set citus.count_distinct_error_rate to 0.005; postgres=# select count(distinct a) from test; count ------- 10 (1 row) Time: 2375.473 ms (00:02.375)
postgres=# set citus.count_distinct_error_rate to 0; count ---------- 10000000 (1 row) Time: 5826241.205 ms (01:37:06.241)
128個節(jié)點,每個節(jié)點最多發(fā)送10億/128條數據給coordinator,慢是可以理解的。另一方面,coordinator可以邊接收邊去重(postgresql 11增加了parallel gather, merge sort等能力,citus coordinator可以借鑒),沒必要等所有數據都收完再去重。
postgres=# set citus.count_distinct_error_rate to 0.005; postgres=# select count(distinct (a,c)) from test; count --------- 9999995 (1 row) Time: 4468.749 ms (00:04.469)
newdb=# set citus.count_distinct_error_rate to 0.1; newdb=# select count(distinct (aid,bid)) from pgbench_accounts ; count ---------- 94778491 (1 row) Time: 545.301 ms newdb=# set citus.count_distinct_error_rate to 0.01; newdb=# select count(distinct (aid,bid)) from pgbench_accounts ; count ----------- 100293937 (1 row) Time: 554.333 ms -- 推薦設置0.005 newdb=# set citus.count_distinct_error_rate to 0.005; newdb=# select count(distinct (aid,bid)) from pgbench_accounts ; count ----------- 100136086 (1 row) Time: 1053.070 ms (00:01.053) newdb=# set citus.count_distinct_error_rate to 0.001; newdb=# select count(distinct (aid,bid)) from pgbench_accounts ; count ----------- 100422107 (1 row) Time: 9287.934 ms (00:09.288)
到此,關于“在分布式數據庫中怎么計算count”的學習就結束了,希望能夠解決大家的疑惑。理論與實踐的搭配能更好的幫助大家學習,快去試試吧!若想繼續(xù)學習更多相關知識,請繼續(xù)關注創(chuàng)新互聯-成都網站建設公司網站,小編會繼續(xù)努力為大家?guī)砀鄬嵱玫奈恼拢?/p>
本文題目:在分布式數據庫中怎么計算count-創(chuàng)新互聯
轉載來于:http://weahome.cn/article/dcspig.html