Add search indexing to native Fairdata workflow as callback

This is a proposal for the steps needed to incorporate a callback whenever a record is updated on Fairdata to index the updated record(s) in an associated Typesense index. This would have many advantages, including...
- More or less instant updates rather than waiting for scheduled cron job;
- Would run on the main worker dyno rather than needing to spin up new dynos each time a job is run and worry about what size is needed for each project; with this system all updates would only involve a handful of records at once, so resources shouldn't be a problem;
- Parameters for the Typesense config are managed within the fairdata database, rather than....not.
- No need to pull all the model data and send it to Typesense with every update; only data that actually changes is sent.
- Could potentially simplify the process of configuring searches for a `core-data-places` site, which currently requires a full Typesense config for each search; at least some of that info could now probably be grabbed from Fairdata, or at least this opens the possibility for that.

### The Steps
1. Add `typesense_host` and `typesense_api` fields at the project level, similar to how we've added `faircopy_url` etc.
    - This will involve a schema migration to add the fields, as well as an update to the project configuration UI form to allow the values to be entered and saved. Basically I would copy exactly how the `faircopy_url` field is implemented.
    - Could be one JSON object rather than two string fields if that seems better.
    - I'm imagining long term that Typesense will be one of several options for what to do with the search JSON generated by the `after_save` callback.
2. Add a `typesense_index` field at the model level.
    - This could potentially also be a JSON field; in either case there are a couple of flags that we'd also want the ability to control per-model, namely a `polygons` flag and some sort of `index_on_save` flag that can be turned off if for some reason they want to make changes without updating the search index. (Do we want to allow that? I guess just deleting the index name would accomplish the same thing? Maybe we don't want/need to allow it... but we do need the `polygons` flag option.)
3. Add `after_save` callbacks to all model types (or whatever exactly the right callback would be) that does the following:
    - The easy(?) part: The logic for creating search JSON from a given record [already exists](https://github.com/performant-software/core-data-connector/blob/f10cb17b4bf53c0389c6d3d23e25eb6173c00759/app/services/core_data_connector/search/base.rb#L316). After a record is updated, we can mimic the code from [here](https://github.com/performant-software/core-data-connector/blob/f10cb17b4bf53c0389c6d3d23e25eb6173c00759/lib/typesense/search.rb#L37) to update the record in the associated Typesense index, using a Typesense client initialized with the parameters that come from the project and specific model (see [here](https://github.com/performant-software/core-data-connector/blob/f10cb17b4bf53c0389c6d3d23e25eb6173c00759/lib/typesense/base.rb#L7)).
    - If the record was deleted, then instead we can call the `delete` method on the client and filter by `uuid`. See [here](https://github.com/performant-software/core-data-connector/blob/f10cb17b4bf53c0389c6d3d23e25eb6173c00759/lib/typesense/search.rb#L42).
    - The hard part: Relationships. This is the part that's making my head hurt a bit to think through, possibly because Rails stuff is still kind of sorcery to me. If record A is edited to have a new relationship with record B, then obviously the Typesense callback needs to happen for both items A and B. That part I feel should be doable -- the callback for after a new relationship is saved can just send the data for both records involved. The more subtle case is if records A and B are related and some field of A is modified. This should trigger not only a reindex of A but also of B, since we index names and UDFs of related records. Perhaps that's not as annoying as I'm fearing, though; perhaps we just need the callback function to find all related records and then call the reindexing function on them. But we definitely need to think through all the possible cases -- a record related to record A was deleted; a relationship was deleted; etc.
4. On the Typesense side, I think there are things we could do to facilitate this workflow, specifically making use of [aliases](https://typesense.org/docs/30.1/api/collection-alias.html#collection-alias). I propose that by default (barring some other specific need) we create a `<model>_prod` and `<model>_staging` alias for each model we're indexing on a project. Then if the client desires a "frozen" version of the data for a production site or whatever, we can point the `_prod` alias at an index that we populate all at once and doesn't get updated. Meanwhile the `_staging` alias can be pointed at another index (perhaps named `<model>_<month>_<year>` or something?) that is the one that's actually configured to be updated as the data is updated. Then when we do a data "push" to production, all that's needed is to change the `_prod` alias to point at the up-to-date index. The old frozen index can be deleted and a new "live" index created, with the `_staging` alias pointing to it. This way there's zero risk of downtime or errors on production as the data is updated.

### Some further notes
- I think this would be a huge improvement over the current system; but I am not confident enough in Rails stuff to be certain exactly how big a lift it is. I think I have a basic scaffold in my head of how it should work but obviously I welcome comments/corrections/improvements.
- This outline does not include any solution for project-specific indexing scripts, such as indexing full text from an associated Faircopy project.
- It also does not include any solution for the problem that's come up a few times of desiring to be able to index a relation-of-relation model, e.g. a places search with a facet for an identity label taxonomy related to people. I don't feel sure whether that's something we want to add to the Rails scripts as an option somehow, or whether we want to continue to meet that need by running a project-specific node script that grabs the desired second-order data and adds it to the index (that's what I've so far done for both NBU and MiCA, but obviously it's a bit ad hoc).
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add search indexing to native Fairdata workflow as callback #596

The Steps

Some further notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add search indexing to native Fairdata workflow as callback #596

Description

The Steps

Some further notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions