Skip to content
Snippets Groups Projects

Help browser: add an index cache to speed up launch times.

Merged Albert Gräf requested to merge aggraef/purr-data:search-index-cache into master

This is another one of my pet peeves, so it would be nice if it could still make it into 2.14.1. Added a cache to the index to considerably speed up launching the help browser. Tested on Linux and Windows so far, there I get speedups in the 3-4x ballpark, depending on the OS and the number of indexed help patches (the more the better).

Features:

  • Rebuilds and caches the search index if there is no cache yet, or if the cache is outdated (i.e., any cached directory or its parent was removed or modified since the cache was last created).

  • Keeps track of the modification times of all cached directories of help files, as well as their parent directories, in order to determine when the cache is outdated. The parents are tracked to catch changes where new sibling directories are added. This may produce some false positives and still isn't 100% foolproof, but it is as close as we can get if we still want to achieve substantial speedups.

  • Automatically rebuilds the index (and cache) from scratch, as soon as the help browser is relaunched after changes to the browser configuration in the gui prefs. So there's no need any more to relaunch Purr Data to have these changes take effect.

Both the index cache and the directory timestamps are maintained as ordinary text files with a straightforward syntax (basically colon-delimited csv) in the user's home directory. The files are located in configuration directories (.purr-data on Linux and Mac, AppData/Roaming/Purr-Data on Windows). Thus it's easily possible to modify these files with external tools (e.g., if we want to upgrade the file format in the future.)

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Why did you choose your own textfile format? Doesn't elasticlunr have a toJSON method and a load method to handle saving the index/documentStore?

  • Author Developer

    Maybe it has, but I don't know elasticlunr very well, it's a magical box for me. ;-)

    It's not my own format, it's just (colon-separated) csv. What's wrong with that? I like to keep things simple, and csv is easy and fast to generate and parse, it works well with the dead-simple data that we have here, and it doesn't need an elaborate framework. Also, it allows me to have both the index and the timestamps in the same format.

  • Maybe it has, but I don't know elasticlunr very well, it's a magical box for me. ;-)

    Well, it's the thing that handles the index you are caching so it can't be a completely opaque. :)

    It's not my own format, it's just (colon-separated) csv. What's wrong with that?

    But how are you getting the index from internal representation to the csv?

    It just seems like it would be way more bulletproof to just use the pre-existing methods that serialize and deserialize the data using JSON. But perhaps I'm misunderstanding what exactly you're doing with the csv.

  • Author Developer

    Well, it's the thing that handles the index you are caching so it can't be a completely opaque. :)

    For me it's enough that it does the job of quickly finding matches in the database. ;-)

    But how are you getting the index from internal representation to the csv?

    Quite simple, they're generated alongside each other when the index is built. You can see that in the bit added at pdgui.js:153 in the diffs, right before the call to index.addDoc. That doesn't noticeably slow down the index build, and when the cache needs to be rebuilt then we have to build the index from scratch anyway, so just building the two in concert works just fine.

    It just seems like it would be way more bulletproof to just use the pre-existing methods that serialize and deserialize the data using JSON.

    That doesn't solve the problem of keeping track of the timestamps, and I doubt that my code has much more bugs (which don't exist, I've tested it, haha :) than a much more complicated piece of software. Moreover, I feel responsible for that code, I take some pride in it, so you can rest assured that I'll keep maintaining it and fix bugs quickly. I feel much less inclined to dive into elasticlunr's serialization code. And don't tell me that there aren't any bugs in there, you know there are, every sufficiently complex piece of software has them. ;-)

    Edited by Albert Gräf
  • Author Developer

    See, frameworks are nice, and it's one of JavaScript's great strengths that it has so many that we can use. But I've learned that simplicity trumps complexity every single time. If you can get away with it, that is. There's stuff that's just intrinsically complicated, and then you want (and need) a well-tested framework. This is not one of those cases, though, unless our doc data structure suddenly becomes fiendishly complicated, and I don't see that happen.

    Come on, Jonathan, this MR is something you want, too, I know that. ;-) Give me a little time so that I can test it some more on the Mac, and then IMHO we can be reasonably sure that this implementation works. We can always improve it (or make it more complicated) later...

  • But I've learned that simplicity trumps complexity every single time.

    Exactly-- one framework trumps two. :)

    Anyway, if you've got it working you've got it working.

    Question on Windows-- are you sure the time spent checking timestamps is significantly less than the time spent creating the index? I know any kind of filesystem calls are obnoxiously slow there.

    Another question-- how fast is it to load the cache using your current algo vs. simply loading the cache without checking the timestamps for each directory? If there's a substantial difference I have a suggestion...

  • Author Developer

    Question on Windows-- are you sure the time spent checking timestamps is significantly less than the time spent creating the index?

    Yes. To give you some concrete figures, in the VM installation I use, with the help browser settings at defaults, building the index from scratch takes about 1 sec, while loading the cache and timestamp files, checking the timestamps, and populating the index from cache takes a mere 0.2 secs. (Of course, all this depends a lot on the internal state of the Windows file system, which can be so excruciatingly slow at times, especially in the VM, that I wonder how someone could have implemented it so badly. But while the initial index build time can vary wildly, I found that the cache check and load times are pretty consistent.)

    This sounds like magic, until you consider that I'm only checking the timestamps of the directories (including parents of directories which are in the cache), not individual files. That's the main trick here. The total number of files is typically more than an order of magnitude higher than the number of directories to be checked (792 vs. 40 in the particular case with default browser settings on Windows).

    Another question-- how fast is it to load the cache using your current algo vs. simply loading the cache without checking the timestamps for each directory?

    Loading the cache is almost instantaneous (I'm using fs.readFileSync to slurp the entire file in one go), so the lion part of those 0.2 secs is indeed spent checking the timestamps, and constructing the index. But if you have a suggestion on how to speed that up even further, then please go ahead, I'm all ears. :)

    Edited by Albert Gräf
  • Author Developer

    I should maybe add that I already did quite a bit of optimization, like reading the files in one go, and using carefully crafted regexes in generating and parsing the files. So I'd be surprised if we could gain another 2-3 speedup. But I'd be pleasantly surprised. ;-)

  • Author Developer

    Works well for me on the Mac, too, loads the full set (doc+extra+helppath) from cache in just 0.4 secs. Which is very good news for me, as I prefer to have everything indexed, but couldn't do it on the Mac, because it was just too damn slow there previously.

  • Author Developer

    The CI runners all seem to be offline now, I guess that they're being worked on? Anyway, my own "CI" is green and passes all tests, so feel free to merge whenever it's convenient. :)

  • Author Developer

    Also tested with Windows 10 on real iron (Lenovo X1 Yoga 2018) now. The initial scan (full set in this test) needed some 20 secs after a fresh install of Purr, but loading from the cache then took around just 0.9 secs (tested that several times, it seems to be < 1 sec pretty consistently, with some outliers requiring a little over 1 sec). This is a bit slower than my 2012 MacBook, but still negligible compared to the previous situation. In any case, with this MR indexing just everything is a viable option on Mac and Windows now, which it wasn't previously.

    As expected, Linux blows those two out of the water. :) Initial scan on the same X1 Yoga (full set again) takes a mere 1.3 secs, and loads from cache in some 0.3 secs, which is hardly noticeable at all.

  • Author Developer

    So at this point further optimization is probably a game with rapidly diminishing returns, I'd say. Let's just go with this. Ready to be merged. :)

  • Sounds good.

  • Author Developer

    Jonathan, I'm really done with this, honestly. :) So what are you waiting for? CI to come back online?

  • Albert Gräf mentioned in merge request !551 (merged)

    mentioned in merge request !551 (merged)

  • Author Developer

    CI is still offline, so I cancelled my pending pipelines now. Can we go ahead anyway?

Please register or sign in to reply
Loading