18-分页与遍历 - From, Size, Search_after Scroll API
分布式系统中深度分⻚的问题
● ES 天⽣就是分布式的。查询信息,但是数据分别保存在多个分⽚,多台机器上,ES 天⽣就需要满⾜排序的需要(按照相关性算分)
● 当⼀个查询: From = 990, Size =10
- 会在每个分⽚上先都获取 1000 个⽂档。然后, 通过 Coordinating Node 聚合所有结果。最后 再通过排序选取前 1000 个⽂档
- ⻚数越深,占⽤内存越多。为了避免深度分⻚带 来的内存开销。ES 有⼀个设定,默认限定到 10000个⽂档
● From + Size 必须⼩与 10000
Search After
● 避免深度分⻚的性能问题,可以实时获取下⼀⻚⽂档信息
- 不⽀持指定⻚数(From)
- 只能往下翻
● 第⼀步搜索需要指定 sort,并且保证值是唯⼀的 (可以通过加⼊ _id 保证唯⼀性)
● 然后使⽤上⼀次,最后⼀个⽂档的 sort 值进⾏查询
Scroll API
● 创建⼀个快照,有新的数据写⼊以后,⽆法被查到
● 每次查询后,输⼊上⼀次的 Scroll Id
search_after
POST tmdb/_search
{"from": 10000,"size": 1,"query": {"match_all": {}}
}#Scroll API
DELETE usersPOST users/_doc
{"name":"user1","age":10}POST users/_doc
{"name":"user2","age":11}POST users/_doc
{"name":"user2","age":12}POST users/_doc
{"name":"user2","age":13}POST users/_countPOST users/_search
{"size": 1,"query": {"match_all": {}},"sort": [{"age": "desc"} ,{"_id": "asc"} ]
}POST users/_search
{"size": 1,"query": {"match_all": {}},"search_after":[10,"ZQ0vYGsBrR8X3IP75QqX"],"sort": [{"age": "desc"} ,{"_id": "asc"} ]
}
Scroll API
#Scroll API
DELETE users
POST users/_doc
{"name":"user1","age":10}POST users/_doc
{"name":"user2","age":20}POST users/_doc
{"name":"user3","age":30}POST users/_doc
{"name":"user4","age":40}POST /users/_search?scroll=5m
{"size": 1,"query": {"match_all" : {}}
}POST users/_doc
{"name":"user5","age":50}
POST /_search/scroll
{"scroll" : "1m","scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAWAWbWdoQXR2d3ZUd2kzSThwVTh4bVE0QQ=="
}