Matching Engine

SanteMPI uses the SanteDB matching engine. This page may be moved in the future to a common page as the SanteDB matching engine supports more than just Patient resources.

The matching engine in SanteDB is a multi-stage process whereby an inbound (or a target record) is compared with the current dataset within the CDR. The matching process occurs in three stages:

  • Blocking : In the blocking phase records are queried from the CDR's database infrastructure. The blocking phase is used to reduce the total number of records being scored, which can be more computationally expensive.

  • Scoring : In the scoring stage, the target record is compared to those records retrieved from the blocking phase. Each attribute configured is transformed/normalized and then a score applied for each attribute. Furthermore there are API hooks for implementing partial weighting/partial scoring of an attribute (based on frequency of data in database, NLP, or other future methods)

  • Classification : In the classification stage, each record's score (the sum of each attribute's score) is grouped into Match, NonMatch, and ProbableMatch. These thresholds are configured.

SanteDB allows for the configuration of multiple match configurations, and allows configuration for the "default" match configuration to be used for regular operation (whenever data is inserted, updated, etc.).

Blocking

In the blocking phase of the matching execution, the candidate record (named $input) is compared with records persisted in the database using one or more blocking configurations.

A blocking configuration is expressed using the HDSI Query Syntax which is translated into SQL. Blocking instructions can use any supported Filter Functions for the selected database.

Records are read from the database (as blocks) with multiple blocks being combined with either an intersect or union function. Blocks can be loaded as either:

  • SOURCE Records: Whereby the source records are loaded from the database as they were sent. Blocking in this mode is less CPU intensive and less database intensive, however relies on source information as a "picture" of what data is available for a patient.

  • MASTER Records: Where by the blocks are loaded using the MDM layer and are computed based on existing known and suspected links. This method of blocking more closely resembles what users see in the UI when MDM is enabled, however it does slow down matching performance as each record must be cross-referenced with the master data record. It also allows for matching based on records of truth.

Blocks from each statement are combined together to form a result set (in C# an IEnumerable<T>) which are passed into the scoring stage.

In the example below, the SanteDB matching engine will load records for an $input record by:

  • If the input record contains an SSN identifier , it will filter records in the database by matching SSN. It will then perform an MDM gather (i.e. the matching mode is performed on MASTER records) , these records will be UNION with

  • The results of a local query whereby:

    • If the $input has Given name, then the given name must match, AND

    • If the $input has a Family name, then the family name must match, AND

    • If the $input has a dateOfBirth then the date of birth must match, AND

    • If the $input has a gender concept then the gender must match

In pseudocode terms, the blocking query for an $input of John Smith, Male, SSN 203-204-3049 born 1980-03-02 would resemble:

SELECT * 
FROM 
    masters AS master
    LEFT JOIN locals AS local
WHERE
    local.identifier[SSN].value = '203-204-3049'
UNION ALL 
SELECT * 
FROM 
    locals AS local
WHERE 
    local.name.component[Given].value = 'John'
    AND local.name.component[Family].value = 'Smith'
    AND local.dateOfBirth = '1980-03-02' 
    AND local.gender = 'Male'

The actual SQL generated by the SanteDB iCDR is must more complex, this example illustrates the concept.

Scoring

During the scoring phase, the records from the blocking stage are compared to the $input record on an attribute by attribute basis using a collection of assertions about the attribute. If all assertions on an attribute evaluate to TRUE then the matchWeight score is added to that records total score, if the assertions are FALSE then the nonMatchWeight score (a negative score) is added to the record's total score.

Overall, the process of comparing a blocked record (named $block) with the $input record is:

  1. The scoring attribute may declare that it depends on another attribute being scored (i.e. don't evaluate the city attribute unless state attribute has passed. If the dependent attribute was not scored as a positive (match) then the current attribute is assigned the whenNull() score.

  2. The attribute path on the configuration is read from both $block and $input.

  3. The matching engine determines if either the $block or $input attribute value extracted is null. If either is null then the the When Null Actions is taken.

  4. The matching engine then determines if any transforms have been configured (see: Transforming Data). This is a process whereby data is extracted, tokenized, shifted, padded, etc. on both the $input and $block variables. The result of each transform is stored as the new attribute value in memory and the next transform is applied against the output of the previous.

  5. Finally the actual assertion is applied. The assertion is usually a binary logical operator (less than, equal, etc.) the result of which results in the matchWeight or nonMatchWeight being applied.

When Null Actions

There can occur instances of either the inbound record or the source records from the database which are missing the specified attribute. When this is the case the attribute's whenNull attribute controls the behavior of the evaluation of that attribute. The behaviors are:

@whenNull

Description

match

When the value is null in $block or $input, treat the attribute as a "match" (i.e. assume the missing data matches)

nonmatch

When the value is null in either $block or $input, treat the attribute as a non-match

(i.e. assume the missing data would not match)

ignore

When the value is null in $block or $input, don't evaluate the attribute. (i.e. neither the non-match or match scores should be considered). This is similar to zero however, the entire attribute is removed from consideration. The absolute score is added no value, however the total possible score is reduced.

disqualify

When the value is null in $block or $input disqualify the entire record from consideration

(i.e. it doesn't matter what the other attribute scores are, the record is considered

not a match)

zero

When the value is null in $block or $input the attribute should be scored a 0. This is different than applying the match or non-match weight which will be a positive or negative number respectively. This is similar to the ignore setting except zero does not impact the denominator of the score.

Transforming Data

The date_extract is applied to both records and then the assertion of "eq" is applied. The following data transforms are available in SanteDB.

Transform

Applies

Action

addresspart_extract

Entity Address

Extracts a portion of an address for consideration

date_difference

Date

Extracts the difference (in weeks, months, years, etc.) between the A record and B record.

date_extract

Date

Extracts a portion of the date from both records.

name_alias

Entity Name

Considers any of the configured aliases for the name meeting a particular threshold of relevance (i.e. Will = Bill is stronger than Tess = Theresa)

abs

Number

Returns the absolute value of a number

dmetaphone

Text

Returns the double metaphone code (with configured code length) from the input string

length

Text

Returns the length of text string

levenshtein

Text

Returns the levenshtein string difference between the A and B values.

sorensen_dice

Text

Returns the Sorensen Dice coefficient of text content A and B values.

jaro_winkler

Text

Returns the Jaro-Winkler (with default threshold of 0.7) between A and B values.

metaphone

Text

Returns the metaphone phonetic code for the attribute

similarity

Text

Returns a %'age (based on levenshtein difference) of string A to string B (i.e. 80% similar or 90% similar)

soundex

Text

Extracts the soundex codification of the input values.

substr

Text

Extracts a portion of the input values

tokenize

Text

Tokenizes the input string based on splitting characters, the tokenized values can then be transformed independently using any of the transforms listed in this table.

timespan_extract

TimeSpan

Extracts a portion of a timespan such as minutes, hours, seconds.

Classification

The scores for each of the scored attributes are then summed for each $block record and the block is classified as:

  • Match: The two records are determined to agree with one another according to configuration. the matching engine is instructing whatever called it (the MDM, MPI, etc.) that the two should be linked/merged/combined/etc.

  • Possible: The two records are not "non-matches" however there is insufficient confidence that the two are definite matches. Whatever called the matching operation should flag the two for manual reconciliation.

  • Non-Match: The two records are definite non-matches.

Bulk / Batch Matching

Each data type which is registered in the Master Data Management pattern has a corresponding Match Job registered. This job resets the suspected ground truth using the following rules:

Remove / Delete:

Keep:

  • Suspected Client Links (MDM-Candidate links)

  • Automatic Master Links

  • Verified Ignore

  • Verified Matches

  • System Master Links

After the suspected truth is cleared, the job will begin the process of re-matching the registered dataset for SanteDB. The matching process is multi-threaded, and designed to ensure that the machine on which the match process is as efficient as possible. To do this, the following pattern is used:

The batch matching process registers 4 primary threads on the actively configured thread pool to control the match process:

  • Gather Thread: This worker is responsible for querying data from the source dataset in 1,000 record batches. The rate at which the records are loaded will depend on the speed of the persistence layer (SanteDB 2.1.x or 2.3.x) as well as the disk performance of the storage technology.

  • Match Thread: This worker is responsible for breaking the 1,000 record batches into smaller partitions of records (depending on CPU of host machine). The configured matching algorithm is then launched for each record in the batch on independent threads (i.e. matching is done concurrently on the host machine).

  • Writer Thread: Once the match thread workers have completed their matching task, the results are queued for the writer thread. The writer thread is responsible for committing the matches to the read/write database.

  • Monitor Thread: The monitoring thread is responsible for updating the status of the job.

The performance of the batch matching will depend on the speed of the host machine as well as the version of SanteDB that is being used.

SanteSuite's community server was used for testing in the following configuration:

  • Application Server:

    • 4 VCPU

    • 4 GB RAM

    • Non-persistent (ram-only) REDIS Cache

  • Database Server:

    • 12 VCPU

    • 12 GB RAM

    • RAID 1+0 SSD disk infrastructure (4+4 SSD)

The versions of SanteDB tested yielded the following results:

  • Version < 2.1.160 of SanteDB: ~28,000 records per hour

  • Version > 2.1.165 of SanteDB: ~50,000 records per hour

  • Version 2.3.x of SanteDB (internal alpha): ~100,000 records per hour

It is important to ensure that your host system is configured such that the thread pool (accessed through the Probes administrative panel) has at minimum, 5 available worker threads to complete batch matching.

Last updated