主頁 > 知識庫 > postgreSQL 非count方法算記錄數(shù)操作

postgreSQL 非count方法算記錄數(shù)操作

一般方法

select count(1) from table_name;

全量掃描一遍表，記錄越多，查詢速度越慢

新法

PostgreSQL 還真提供了一個(gè)這樣的途徑，那就是系統(tǒng)表 pg_class，這個(gè)系統(tǒng)表里頭，存儲著每個(gè)表的統(tǒng)計(jì)信息，其中 reltuples 就是對應(yīng)的表的統(tǒng)計(jì)行,統(tǒng)計(jì)行的數(shù)據(jù)是pg有個(gè)獨(dú)立進(jìn)程，定期掃描不同的表，收集這些表的統(tǒng)計(jì)信息，保存在系統(tǒng)表里頭。

方法如下：

select 
 reltuples::int as total 
from 
 pg_class 
where 
 relname = 'table_name' 
 and relnamespace = (select oid from pg_namespace where nspname = 'schema');

新方法不是通用的，如果要求特精確還是使用select count(1),如果是類似分頁的，且分頁數(shù)量超過比較多的，也不是要求特別精準(zhǔn)的，這就是一個(gè)好方法！

count(1) over 計(jì)算記錄數(shù)

select count(1) over(), * from table_name;

補(bǔ)充

count 是最常用的聚集函數(shù)之一，看似簡單，其實(shí)還是有坑的，如：

1、count(*)：返回結(jié)果集的行數(shù)，是null也統(tǒng)計(jì)

2、count(1)：和count(*)基本沒區(qū)別，pg92之前都是掃描全表的，pg92之后增加了index only scan一般會變成掃主鍵索引，如果沒有主鍵或者是表的列很多的情況下，count(1)快一些，因?yàn)椴粫紤]表的全部字段

3、count(field)：返回?cái)?shù)據(jù)表中指定字段值不等于null的行數(shù)

拓展：理解 PostgreSQL 的 count 函數(shù)的行為

關(guān)于 count 函數(shù)的使用一直存在爭議，尤其是在 MySQL 中，作為流行度越來越高的 PostgreSQL 是否也有類似的問題呢，我們通過實(shí)踐來理解一下 PostgreSQL 中 count 函數(shù)的行為。

構(gòu)建測試數(shù)據(jù)庫

創(chuàng)建測試數(shù)據(jù)庫，并創(chuàng)建測試表。測試表中有自增 ID、創(chuàng)建時(shí)間、內(nèi)容三個(gè)字段，自增 ID 字段是主鍵。

create database performance_test;

create table test_tbl (id serial primary key, created_at timestamp, content varchar(512));

生成測試數(shù)據(jù)

使用 generate_series 函數(shù)生成自增 ID，使用 now() 函數(shù)生成 created_at 列，對于 content 列，使用了 repeat(md5(random()::text), 10) 生成 10 個(gè) 32 位長度的 md5 字符串。使用下列語句，插入 1000w 條記錄用于測試。

performance_test=# insert into test_tbl select generate_series(1,10000000),now(),repeat(md5(random()::text),10); INSERT 0 10000000 Time: 212184.223 ms (03:32.184)

由 count 語句引發(fā)的思考

默認(rèn)情況下 PostgreSQL 不開啟 SQL 執(zhí)行時(shí)間的顯示，所以需要手動開啟一下，方便后面的測試對比。

\timing on

count(*) 和 count(1) 的性能區(qū)別是經(jīng)常被討論的問題，分別使用 count(*) 和 count(1) 執(zhí)行一次查詢。

performance_test=# select count(*) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 115090.380 ms (01:55.090)
 
performance_test=# select count(1) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 738.502 ms

可以看到兩次查詢的速度差別非常大，count(1) 真的有這么大的性能提升？接下來再次運(yùn)行查詢語句。

performance_test=# select count(*) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 657.831 ms
 
performance_test=# select count(1) from test_tbl;
 count
----------
 10000000
(1 row)
 
Time: 682.157 ms

可以看到第一次查詢時(shí)候會非常的慢，后面三次速度非?？觳⑶視r(shí)間相近，這里就有兩個(gè)問題出現(xiàn)了：

為什么第一次查詢速度這么慢？

count(*) 和 count(1) 到底存不存在性能差別？

查詢緩存

使用 explain 語句重新執(zhí)行查詢語句

explain (analyze,buffers,verbose) select count(*) from test_tbl;

可以看到如下輸出：

Finalize Aggregate (cost=529273.69..529273.70 rows=1 width=8) (actual time=882.569..882.570 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=96 read=476095
  -> Gather (cost=529273.48..529273.69 rows=2 width=8) (actual time=882.492..884.170 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared hit=96 read=476095
     -> Partial Aggregate (cost=528273.48..528273.49 rows=1 width=8) (actual time=881.014..881.014 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared hit=96 read=476095
        Worker 0: actual time=880.319..880.319 rows=1 loops=1
         Buffers: shared hit=34 read=158206
        Worker 1: actual time=880.369..880.369 rows=1 loops=1
         Buffers: shared hit=29 read=156424
        -> Parallel Seq Scan on public.test_tbl (cost=0.00..517856.98 rows=4166598 width=0) (actual time=0.029..662.165 rows=3333333 loops=3)
           Buffers: shared hit=96 read=476095
           Worker 0: actual time=0.026..661.807 rows=3323029 loops=1
            Buffers: shared hit=34 read=158206
           Worker 1: actual time=0.030..660.197 rows=3285513 loops=1
            Buffers: shared hit=29 read=156424
 Planning time: 0.043 ms
 Execution time: 884.207 ms

注意里面的 shared hit，表示命中了內(nèi)存中緩存的數(shù)據(jù)，這就可以解釋為什么后面的查詢會比第一次快很多。接下來去掉緩存，并重啟 PostgreSQL。

service postgresql stop
echo 1 > /proc/sys/vm/drop_caches
service postgresql start

重新執(zhí)行 SQL 語句，速度慢了很多。

 Finalize Aggregate (cost=529273.69..529273.70 rows=1 width=8) (actual time=50604.564..50604.564 rows=1 loops=1)
  Output: count(*)
  Buffers: shared read=476191
  -> Gather (cost=529273.48..529273.69 rows=2 width=8) (actual time=50604.508..50606.141 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared read=476191
     -> Partial Aggregate (cost=528273.48..528273.49 rows=1 width=8) (actual time=50591.550..50591.551 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared read=476191
        Worker 0: actual time=50585.182..50585.182 rows=1 loops=1
         Buffers: shared read=158122
        Worker 1: actual time=50585.181..50585.181 rows=1 loops=1
         Buffers: shared read=161123
        -> Parallel Seq Scan on public.test_tbl (cost=0.00..517856.98 rows=4166598 width=0) (actual time=92.491..50369.691 rows=3333333 loops=3)
           Buffers: shared read=476191
           Worker 0: actual time=122.170..50362.271 rows=3320562 loops=1
            Buffers: shared read=158122
           Worker 1: actual time=14.020..50359.733 rows=3383583 loops=1
            Buffers: shared read=161123
 Planning time: 11.537 ms
 Execution time: 50606.215 ms

shared read 表示沒有命中緩存，通過這個(gè)現(xiàn)象可以推斷出，上一小節(jié)的四次查詢中，第一次查詢沒有命中緩存，剩下三次查詢都命中了緩存。

count(1) 和 count(*) 的區(qū)別

接下來探究 count(1) 和 count(*) 的區(qū)別是什么，繼續(xù)思考最開始的四次查詢，第一次查詢使用了 count(*)，第二次查詢使用了 count(1) ，卻依然命中了緩存，不正是說明 count(1) 和 count(*) 是一樣的嗎？

事實(shí)上，PostgreSQL 官方對于 is there a difference performance-wise between select count(1) and select count(*)? 問題的回復(fù)也證實(shí)了這一點(diǎn)：

Nope. In fact, the latter is converted to the former during parsing.[2]

既然 count(1) 在性能上沒有比 count(*) 更好，那么使用 count(*) 就是更好的選擇。

sequence scan 和 index scan

接下來測試一下，在不同數(shù)據(jù)量大小的情況下 count(*) 的速度，將查詢語句寫在 count.sql 文件中，使用 pgbench 進(jìn)行測試。

pgbench -c 5 -t 20 performance_test -r -f count.sql

分別測試 200w - 1000w 數(shù)據(jù)量下的 count 語句耗時(shí)

數(shù)據(jù)大小	count耗時(shí)(ms)
200w	738.758
300w	1035.846
400w	1426.183
500w	1799.866
600w	2117.247
700w	2514.691
800w	2526.441
900w	2568.240
1000w	2650.434

繪制成耗時(shí)曲線

曲線的趨勢在 600w - 700w 數(shù)據(jù)量之間出現(xiàn)了轉(zhuǎn)折，200w - 600w 是線性增長，600w 之后 count 的耗時(shí)就基本相同了。使用 explain 語句分別查看 600w 和 700w 數(shù)據(jù)時(shí)的 count 語句執(zhí)行。

700w：

Finalize Aggregate (cost=502185.93..502185.94 rows=1 width=8) (actual time=894.361..894.361 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=16344 read=352463
  -> Gather (cost=502185.72..502185.93 rows=2 width=8) (actual time=894.232..899.763 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared hit=16344 read=352463
     -> Partial Aggregate (cost=501185.72..501185.73 rows=1 width=8) (actual time=889.371..889.371 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared hit=16344 read=352463
        Worker 0: actual time=887.112..887.112 rows=1 loops=1
         Buffers: shared hit=5459 read=118070
        Worker 1: actual time=887.120..887.120 rows=1 loops=1
         Buffers: shared hit=5601 read=117051
        -> Parallel Index Only Scan using test_tbl_pkey on public.test_tbl (cost=0.43..493863.32 rows=2928960 width=0) (actual time=0.112..736.376 rows=2333333 loops=3)
           Index Cond: (test_tbl.id  7000000)
           Heap Fetches: 2328492
           Buffers: shared hit=16344 read=352463
           Worker 0: actual time=0.107..737.180 rows=2344479 loops=1
            Buffers: shared hit=5459 read=118070
           Worker 1: actual time=0.133..737.960 rows=2327028 loops=1
            Buffers: shared hit=5601 read=117051
 Planning time: 0.165 ms
 Execution time: 899.857 ms

600w：

Finalize Aggregate (cost=429990.94..429990.95 rows=1 width=8) (actual time=765.575..765.575 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=13999 read=302112
  -> Gather (cost=429990.72..429990.93 rows=2 width=8) (actual time=765.557..770.889 rows=3 loops=1)
     Output: (PARTIAL count(*))
     Workers Planned: 2
     Workers Launched: 2
     Buffers: shared hit=13999 read=302112
     -> Partial Aggregate (cost=428990.72..428990.73 rows=1 width=8) (actual time=763.821..763.821 rows=1 loops=3)
        Output: PARTIAL count(*)
        Buffers: shared hit=13999 read=302112
        Worker 0: actual time=762.742..762.742 rows=1 loops=1
         Buffers: shared hit=4638 read=98875
        Worker 1: actual time=763.308..763.308 rows=1 loops=1
         Buffers: shared hit=4696 read=101570
        -> Parallel Index Only Scan using test_tbl_pkey on public.test_tbl (cost=0.43..422723.16 rows=2507026 width=0) (actual time=0.053..632.199 rows=2000000 loops=3)
           Index Cond: (test_tbl.id  6000000)
           Heap Fetches: 2018490
           Buffers: shared hit=13999 read=302112
           Worker 0: actual time=0.059..633.156 rows=1964483 loops=1
            Buffers: shared hit=4638 read=98875
           Worker 1: actual time=0.038..634.271 rows=2017026 loops=1
            Buffers: shared hit=4696 read=101570
 Planning time: 0.055 ms
 Execution time: 770.921 ms

根據(jù)以上現(xiàn)象推斷，PostgreSQL 似乎在 count 的數(shù)據(jù)量小于數(shù)據(jù)表長度的某一比例時(shí)，才使用 index scan，通過查看官方 wiki 也可以看到相關(guān)描述：

It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.[3]

根據(jù) Stackoverflow 上的回答，count 語句查詢的數(shù)量大于表大小的 3/4 時(shí)候就會用使用全表掃描代替索引掃描[4]。

結(jié)論

不要用 count(1) 或 count(列名) 代替 count(*)

count 本身是非常耗時(shí)的

count 可能是 index scan 也可能是 sequence scan，取決于 count 數(shù)量占表大小的比例

以上為個(gè)人經(jīng)驗(yàn)，希望能給大家一個(gè)參考，也希望大家多多支持腳本之家。如有錯(cuò)誤或未考慮完全的地方，望不吝賜教。

您可能感興趣的文章:

在postgresql數(shù)據(jù)庫中判斷是否是數(shù)字和日期時(shí)間格式函數(shù)操作
PostgreSQL 實(shí)現(xiàn)將多行合并轉(zhuǎn)為列
postgresql 實(shí)現(xiàn)sql多行語句合并一行
Postgresql自定義函數(shù)詳解
PostgreSQL刪除更新優(yōu)化操作
Postgresql排序與limit組合場景性能極限優(yōu)化詳解
postgresql通過索引優(yōu)化查詢速度操作
postgresql rank() over, dense_rank(), row_number()用法區(qū)別

標(biāo)簽：崇左銅川黃山湖南仙桃蘭州湘潭衡水

巨人網(wǎng)絡(luò)通訊聲明：本文標(biāo)題《postgreSQL 非count方法算記錄數(shù)操作》，本文關(guān)鍵詞；如發(fā)現(xiàn)本文內(nèi)容存在版權(quán)問題，煩請?zhí)峁┫嚓P(guān)信息告之我們，我們將及時(shí)溝通與處理。本站內(nèi)容系統(tǒng)采集于網(wǎng)絡(luò)，涉及言論、版權(quán)與本站無關(guān)。