Help browser: add an index cache to speed up launch times.

added bugfix feature labels

Why did you choose your own textfile format? Doesn't elasticlunr have a toJSON method and a load method to handle saving the index/documentStore?

Maybe it has, but I don't know elasticlunr very well, it's a magical box for me. ;-)

It's not my own format, it's just (colon-separated) csv. What's wrong with that? I like to keep things simple, and csv is easy and fast to generate and parse, it works well with the dead-simple data that we have here, and it doesn't need an elaborate framework. Also, it allows me to have both the index and the timestamps in the same format.

Maybe it has, but I don't know elasticlunr very well, it's a magical box for me. ;-)

Well, it's the thing that handles the index you are caching so it can't be a completely opaque. :)

It's not my own format, it's just (colon-separated) csv. What's wrong with that?

But how are you getting the index from internal representation to the csv?

It just seems like it would be way more bulletproof to just use the pre-existing methods that serialize and deserialize the data using JSON. But perhaps I'm misunderstanding what exactly you're doing with the csv.

Well, it's the thing that handles the index you are caching so it can't be a completely opaque. :)

For me it's enough that it does the job of quickly finding matches in the database. ;-)

But how are you getting the index from internal representation to the csv?

Quite simple, they're generated alongside each other when the index is built. You can see that in the bit added at pdgui.js:153 in the diffs, right before the call to index.addDoc. That doesn't noticeably slow down the index build, and when the cache needs to be rebuilt then we have to build the index from scratch anyway, so just building the two in concert works just fine.

It just seems like it would be way more bulletproof to just use the pre-existing methods that serialize and deserialize the data using JSON.

That doesn't solve the problem of keeping track of the timestamps, and I doubt that my code has much more bugs (which don't exist, I've tested it, haha :) than a much more complicated piece of software. Moreover, I feel responsible for that code, I take some pride in it, so you can rest assured that I'll keep maintaining it and fix bugs quickly. I feel much less inclined to dive into elasticlunr's serialization code. And don't tell me that there aren't any bugs in there, you know there are, every sufficiently complex piece of software has them. ;-)

See, frameworks are nice, and it's one of JavaScript's great strengths that it has so many that we can use. But I've learned that simplicity trumps complexity every single time. If you can get away with it, that is. There's stuff that's just intrinsically complicated, and then you want (and need) a well-tested framework. This is not one of those cases, though, unless our doc data structure suddenly becomes fiendishly complicated, and I don't see that happen.

Come on, Jonathan, this MR is something you want, too, I know that. ;-) Give me a little time so that I can test it some more on the Mac, and then IMHO we can be reasonably sure that this implementation works. We can always improve it (or make it more complicated) later...

But I've learned that simplicity trumps complexity every single time.

Exactly-- one framework trumps two. :)

Anyway, if you've got it working you've got it working.

Question on Windows-- are you sure the time spent checking timestamps is significantly less than the time spent creating the index? I know any kind of filesystem calls are obnoxiously slow there.

Another question-- how fast is it to load the cache using your current algo vs. simply loading the cache without checking the timestamps for each directory? If there's a substantial difference I have a suggestion...

Question on Windows-- are you sure the time spent checking timestamps is significantly less than the time spent creating the index?

Yes. To give you some concrete figures, in the VM installation I use, with the help browser settings at defaults, building the index from scratch takes about 1 sec, while loading the cache and timestamp files, checking the timestamps, and populating the index from cache takes a mere 0.2 secs. (Of course, all this depends a lot on the internal state of the Windows file system, which can be so excruciatingly slow at times, especially in the VM, that I wonder how someone could have implemented it so badly. But while the initial index build time can vary wildly, I found that the cache check and load times are pretty consistent.)

This sounds like magic, until you consider that I'm only checking the timestamps of the directories (including parents of directories which are in the cache), not individual files. That's the main trick here. The total number of files is typically more than an order of magnitude higher than the number of directories to be checked (792 vs. 40 in the particular case with default browser settings on Windows).

Another question-- how fast is it to load the cache using your current algo vs. simply loading the cache without checking the timestamps for each directory?

Loading the cache is almost instantaneous (I'm using fs.readFileSync to slurp the entire file in one go), so the lion part of those 0.2 secs is indeed spent checking the timestamps, and constructing the index. But if you have a suggestion on how to speed that up even further, then please go ahead, I'm all ears. :)

I should maybe add that I already did quite a bit of optimization, like reading the files in one go, and using carefully crafted regexes in generating and parsing the files. So I'd be surprised if we could gain another 2-3 speedup. But I'd be pleasantly surprised. ;-)

Works well for me on the Mac, too, loads the full set (doc+extra+helppath) from cache in just 0.4 secs. Which is very good news for me, as I prefer to have everything indexed, but couldn't do it on the Mac, because it was just too damn slow there previously.

The CI runners all seem to be offline now, I guess that they're being worked on? Anyway, my own "CI" is green and passes all tests, so feel free to merge whenever it's convenient. :)

Also tested with Windows 10 on real iron (Lenovo X1 Yoga 2018) now. The initial scan (full set in this test) needed some 20 secs after a fresh install of Purr, but loading from the cache then took around just 0.9 secs (tested that several times, it seems to be < 1 sec pretty consistently, with some outliers requiring a little over 1 sec). This is a bit slower than my 2012 MacBook, but still negligible compared to the previous situation. In any case, with this MR indexing just everything is a viable option on Mac and Windows now, which it wasn't previously.

As expected, Linux blows those two out of the water. :) Initial scan on the same X1 Yoga (full set again) takes a mere 1.3 secs, and loads from cache in some 0.3 secs, which is hardly noticeable at all.

So at this point further optimization is probably a game with rapidly diminishing returns, I'd say. Let's just go with this. Ready to be merged. :)

Sounds good.

Jonathan, I'm really done with this, honestly. :) So what are you waiting for? CI to come back online?

mentioned in merge request !551 (merged)

CI is still offline, so I cancelled my pending pipelines now. Can we go ahead anyway?

merged

Help browser: add an index cache to speed up launch times.

Merge request reports

Activity