Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Partial indexes are amazing but you have to keep in mind some pecularities.

If your query doesn't contain a proper match with the WHERE clause of the index - the index will not be used. It is easy to forget about it or to get it wrong in subtle ways. Here is an example from work.

There was an event tracing structure which contained the event severity_id. Id values 0-6 inclusive are user facing events. Severity 7 and up is debug events. In practice all debug events were 7 and there were no other values above 7. This table had a partial index with WHERE severity_id < 7. I tracked down a performance regression, when an ORM (due to programmer error) generated WHERE severity_id != 7. The database is obviously not able to tell that there will never be any values above 7 so the index was not used slowing down event handling. Turning the query to match < 7 fixes the problem. The database might also not be able to infer that the index can be indeed used, for example when prepared statements are involved WHERE severity_id < ?. The database will not be able to tell that all bindings of ? will satisfy < 7 so will not use the index (unless you are running PG 12, then that might depend on the setting of plan_cache_mode[1] but I have not tested that yet).

Another thing is that HOT updates in PostgreSQL can't be performed if the updated field is indexed but that also includes being part of a WHERE clause in a partial index. So you could have a site like HN and think that it would be nice to index stories WHERE vote > 100 to quickly find more popular stories. That index however would nullify the possiblity of a hot update when the vote tally would be updated. Again, not a problem but you need to know the possible drawbacks.

That said, they are great when used for the right purpose. Kudos to the author for a nice article!

[1] - https://postgresqlco.nf/doc/en/param/plan_cache_mode/



> The database is obviously not able to tell that there will never be any values above 7

You say "obviously", but with updated statistics this is the exactly the kind of thing you might expect the planner to know and aid index decisions.

I'm a huge fan of Postgres, coming to it around 5 years ago from at least 10 previous years with SQL Server, but I have hit a few things like this in that time. IME the planner is much more fickle about how you specify your predicates than SQL Server is.


No, I don't think statistics can let you get away with this. Databases are concurrent, you can't guarantee that a different session will not insert a record that invalidates your current statistics.

You could argue that it should be able to use it if the table has a check constraint preventing severity_id above 7 being ever inserted. That is something that could be done, I don't know if PostgreSQL does it (I doubt it) or how feasable it would be.

Is SQL Server able to make an assumption like that purely based on statistics? Genuine question.


> No, I don't think statistics can let you get away with this. Databases are concurrent, you can't guarantee that a different session will not insert a record that invalidates your current statistics.

Of course you can't make a guarantee like that, but why would you need to? Statistics are there to guide planner choices, not make cast iron predictions.


Let us say that in our example table we have 100 000 records with severit_id < 7, 200 000 with severity_id = 7 and 3 records with severity_id = 8.

Statistics claim 100k id < 7, 200k id = 7 and 0 with id > 7. The last 3 updates could have happened right before our query, the statistics didn't update yet.

Let us assume that we blindly trust the statistics and they currently state that there are absolutely no values with severity_id > 7 and you have a query WHERE severity_id != 7 and a partial index on severity_id < 7.

If you trust the statistics and actually use the index the rows containing severity_id = 8 will never be returned by the query even if they exist. So by using the index you only scan 100 k rows and never touch the remaining ~200k. However this query can't be answered without scanning all ~300k records. This means, that on the same database you would get two different results for the exact same query if you decided to drop the index after the first run. The database can't fall back and change the plan during execution.

Perhaps I misunderstood you originally. I thought you suggested that the database should be able to know that it can still use the index because currently the statistics claim that there are no records that would make the result incorrect. You are of course correct, that the statistics are there to guide the planner choices and that is how they are used within PostgreSQL - however some plans will give different results if your assumption about data are wrong.


Because the database has to decided whether or not to use the index. If it decides to use the index, and there are values above 7, then it will misbehave (the query will miss those results). Now of course the database could then scan the rows for values above 7 it missed but at that point there's no point in using the index and you might as well have row scanned for the original query.

As a result, the database has to be 100% sure that there are no values _at all_ above 7 to safely and efficiently use the index, ex. when there's a constraint.


I doubt it? At least the number times the "last updated" column appears on SQL server stats [1] leads me to believe it collects stats async with updates to the table.

The only system I've heard of that relies on up-to-date statistics for correctness is snowflake (long but interesting talk here [2]), where having accurate max/mins for each micro partition is really helpful for cutting down the amount of data in the large range scan queries common in BI. I'd guess that being a BI system, snowflake can get away with higher row update latency too.

[1] https://www.sqlshack.com/sql-server-statistics-and-how-to-pe...

[2] https://www.youtube.com/watch?v=CPWn1SZUZqE


All statistics in postgres are considered best effort guidance. Even if the statistics are wrong it can never impact the correctness of the results.


In your example, the WHERE clause of the query and the partial index didn't match logically, i.e. the query may return rows that are not indexed. There's nothing that postgres can do, and I wouldn't classify the behavior as peculiar.

On the other hand, with sqlite, the WHERE clause of the query and the partial index must match *literally*. So let's say you have a partial index with WHERE severity_id != 0, and a query with WHERE severity_id = 1. All the rows with severity_id = 1 are already indexed, but the engine is still not able to make use of the partial index. This one bit us hard.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: