This is a proposal for the steps needed to incorporate a callback whenever a record is updated on Fairdata to index the updated record(s) in an associated Typesense index. This would have many advantages, including...
- More or less instant updates rather than waiting for scheduled cron job;
- Would run on the main worker dyno rather than needing to spin up new dynos each time a job is run and worry about what size is needed for each project; with this system all updates would only involve a handful of records at once, so resources shouldn't be a problem;
- Parameters for the Typesense config are managed within the fairdata database, rather than....not.
- No need to pull all the model data and send it to Typesense with every update; only data that actually changes is sent.
- Could potentially simplify the process of configuring searches for a
core-data-places site, which currently requires a full Typesense config for each search; at least some of that info could now probably be grabbed from Fairdata, or at least this opens the possibility for that.
The Steps
- Add
typesense_host and typesense_api fields at the project level, similar to how we've added faircopy_url etc.
- This will involve a schema migration to add the fields, as well as an update to the project configuration UI form to allow the values to be entered and saved. Basically I would copy exactly how the
faircopy_url field is implemented.
- Could be one JSON object rather than two string fields if that seems better.
- I'm imagining long term that Typesense will be one of several options for what to do with the search JSON generated by the
after_save callback.
- Add a
typesense_index field at the model level.
- This could potentially also be a JSON field; in either case there are a couple of flags that we'd also want the ability to control per-model, namely a
polygons flag and some sort of index_on_save flag that can be turned off if for some reason they want to make changes without updating the search index. (Do we want to allow that? I guess just deleting the index name would accomplish the same thing? Maybe we don't want/need to allow it... but we do need the polygons flag option.)
- Add
after_save callbacks to all model types (or whatever exactly the right callback would be) that does the following:
- The easy(?) part: The logic for creating search JSON from a given record already exists. After a record is updated, we can mimic the code from here to update the record in the associated Typesense index, using a Typesense client initialized with the parameters that come from the project and specific model (see here).
- If the record was deleted, then instead we can call the
delete method on the client and filter by uuid. See here.
- The hard part: Relationships. This is the part that's making my head hurt a bit to think through, possibly because Rails stuff is still kind of sorcery to me. If record A is edited to have a new relationship with record B, then obviously the Typesense callback needs to happen for both items A and B. That part I feel should be doable -- the callback for after a new relationship is saved can just send the data for both records involved. The more subtle case is if records A and B are related and some field of A is modified. This should trigger not only a reindex of A but also of B, since we index names and UDFs of related records. Perhaps that's not as annoying as I'm fearing, though; perhaps we just need the callback function to find all related records and then call the reindexing function on them. But we definitely need to think through all the possible cases -- a record related to record A was deleted; a relationship was deleted; etc.
- On the Typesense side, I think there are things we could do to facilitate this workflow, specifically making use of aliases. I propose that by default (barring some other specific need) we create a
<model>_prod and <model>_staging alias for each model we're indexing on a project. Then if the client desires a "frozen" version of the data for a production site or whatever, we can point the _prod alias at an index that we populate all at once and doesn't get updated. Meanwhile the _staging alias can be pointed at another index (perhaps named <model>_<month>_<year> or something?) that is the one that's actually configured to be updated as the data is updated. Then when we do a data "push" to production, all that's needed is to change the _prod alias to point at the up-to-date index. The old frozen index can be deleted and a new "live" index created, with the _staging alias pointing to it. This way there's zero risk of downtime or errors on production as the data is updated.
Some further notes
- I think this would be a huge improvement over the current system; but I am not confident enough in Rails stuff to be certain exactly how big a lift it is. I think I have a basic scaffold in my head of how it should work but obviously I welcome comments/corrections/improvements.
- This outline does not include any solution for project-specific indexing scripts, such as indexing full text from an associated Faircopy project.
- It also does not include any solution for the problem that's come up a few times of desiring to be able to index a relation-of-relation model, e.g. a places search with a facet for an identity label taxonomy related to people. I don't feel sure whether that's something we want to add to the Rails scripts as an option somehow, or whether we want to continue to meet that need by running a project-specific node script that grabs the desired second-order data and adds it to the index (that's what I've so far done for both NBU and MiCA, but obviously it's a bit ad hoc).
This is a proposal for the steps needed to incorporate a callback whenever a record is updated on Fairdata to index the updated record(s) in an associated Typesense index. This would have many advantages, including...
core-data-placessite, which currently requires a full Typesense config for each search; at least some of that info could now probably be grabbed from Fairdata, or at least this opens the possibility for that.The Steps
typesense_hostandtypesense_apifields at the project level, similar to how we've addedfaircopy_urletc.faircopy_urlfield is implemented.after_savecallback.typesense_indexfield at the model level.polygonsflag and some sort ofindex_on_saveflag that can be turned off if for some reason they want to make changes without updating the search index. (Do we want to allow that? I guess just deleting the index name would accomplish the same thing? Maybe we don't want/need to allow it... but we do need thepolygonsflag option.)after_savecallbacks to all model types (or whatever exactly the right callback would be) that does the following:deletemethod on the client and filter byuuid. See here.<model>_prodand<model>_stagingalias for each model we're indexing on a project. Then if the client desires a "frozen" version of the data for a production site or whatever, we can point the_prodalias at an index that we populate all at once and doesn't get updated. Meanwhile the_stagingalias can be pointed at another index (perhaps named<model>_<month>_<year>or something?) that is the one that's actually configured to be updated as the data is updated. Then when we do a data "push" to production, all that's needed is to change the_prodalias to point at the up-to-date index. The old frozen index can be deleted and a new "live" index created, with the_stagingalias pointing to it. This way there's zero risk of downtime or errors on production as the data is updated.Some further notes