NLPQL Expression Evaluation

Overview

In this section we describe the mechanisms that ClarityNLP uses to evaluate NLPQL expressions. NLPQL expressions are found in define statments such as:

define hasFever:
    where Temperature.value >= 100.4;

define hasSymptoms:
    where hasFever AND (hasDyspnea OR hasTachycardia);

The expressions in each statement consist of everything between the where keyword and the semicolon:

Temperature.value >= 100.4

hasFever AND (hasDyspnea OR hasTachycardia)

NLPQL expressions can either be mathematical or logical in nature, as these examples illustrate.

Recall that the processing stages for a ClarityNLP job proceed roughly as follows:

  1. Parse the NLPQL file and determine which NLP tasks to run.
  2. Formulate a Solr query to find relevant source documents, partition the source documents into batches, and assign batches to computational tasks.
  3. Run the tasks in parallel and write individual task results to MongoDB. Each individual result from an NLP task comprises a task result document in the Mongo database. The term document is used here in the MongoDB sense, meaning an object containing key-value pairs. The MongoDB ‘documents’ should not be confused with the Solr source documents, which are electronic health records.
  4. Evaluate NLPQL expressions using the task result documents as the source data. Write expression evaluation results to MongoDB as separate result documents.

Thus ClarityNLP evaluates expressions after all tasks have finished running and have written their individual results to MongoDB. The expression evaluator consumes the task results inside MongoDB and uses them to generate new results from the expression statements.

We now turn our attention to a description of how the expression evaluator works.

The expression evaluator is built upon the MongoDB aggregation framework. Why use MongoDB aggregation to evaluate NLPQL expressions? The basic reason is that ClarityNLP writes results from each run to a MongoDB collection, and it is more efficient to evaluate expressions using MongoDB facilities than to use something else. Use of a non-Mongo evaluator would require ClarityNLP to:

  • Run a set of queries to extract the data from MongoDB
  • Transmit the query results across a network (if the Mongo instance is hosted remotely)
  • Ingest the query results into another evaluation engine
  • Evaluate the NLPQL expressions and generate results
  • Transmit the results back to the Mongo host (if the Mongo instance is hosted remotely)
  • Insert the results into MongoDB.

Evaluation via the MongoDB aggregation framework is more efficient than this process, since all data resides inside MongoDB.

NLPQL Expression Types

In the descriptions below we refer to NLPQL variables, which have the form nlpql_feature.field_name. The NLPQL feature is a label introduced in a define statement. The field_name is the name of an output field generated by the task associated with the NLPQL feature.

The output field names from ClarityNLP tasks can be found in the NLPQL Reference.

1. Simple Mathematical Expressions

A simple mathematical expression is a string containing NLPQL variables, operators, parentheses, or numeric literals. Some examples:

Temperature.value >= 100.4
(Meas.dimension_X > 5) AND (Meas.dimension_X < 20)
(0 == Temperature.value % 20) OR (1 == Temperature.value % 20)

The variables in a simple mathematical expression all refer to a single NLPQL feature.

Simple mathematical expressions produce a result from data contained in a single task result document. The result of the expression evaluation is written to a new MongoDB result document.

2. Simple Logic Expressions

A simple logic expression is a string containing NLPQL features, parentheses, and the logic operators AND, OR, and NOT. For instance:

hasRigors OR hasDyspnea
hasFever AND (hasDyspnea OR hasTachycardia)
(hasShock OR hasDyspnea) AND (hasTachycardia OR hasNausea)
(hasFever AND hasNausea) NOT (hasRigors OR hasDyspnea)

Logic expressions operate on high-level NLPQL features, not on numeric literals or NLPQL variables. The presence of a numeric literal or NLPQL variable indicates that the expression is either a mathematical expression or possibly invalid.

Simple logic expressions produce a result from data contained in one or more task result documents. In other words, logic expressions operate on sets of result documents. The result from the logical expression evaluation is written to one or more new MongoDB result documents (the details will be explained below).

The NOT operator requires additional commentary. ClarityNLP supports the use of NOT as a synonym for “set difference”. Thus A NOT B means all elements of set A that are NOT also elements of set B. The use of NOT to mean “set complement” is not supported. Hence expressions such as NOT A, NOT hasRigors, etc., are invalid NLPQL statements. The NOT operator must appear between two other expressions.

3. Mixed Expressions

A mixed expression is a string containing either:

  • A mathematical expression and a logic expression
  • A mathematical expression using variables involving two or more NLPQL features

For instance:

// both math and logic
(Temperature.value >= 100.4) AND (hasDyspnea OR hasTachycardia)

// two NLPQL features: LesionMeasurement and Temperature
(LesionMeasurement.dimension_X >= 10) OR (Temperature.value >= 100.4)

// math, logic, and multiple NLPQL features
Temperature.value >= 100.4 AND (hasRigors OR hasNausea) AND (LesionMeasurement.dimension_X >= 15)

The evaluation mechanisms used for mathematical, logic, and mixed expressions are quite different. To fully understand the issues involved, it is helpful to first understand the meaning of the ‘intermediate’ and ‘final’ phenotype results.

Phenotype Result CSV Files

Upon submission of a new job, ClarityNLP prints information to stdout that looks similar to this:

HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8
Content-Length: 1024
Access-Control-Allow-Origin: *
Server: Werkzeug/0.14.1 Python/3.6.4
Date: Fri, 23 Nov 2018 18:40:38 GMT
{
   "job_id": "11108",
   "phenotype_id": "11020",
   "phenotype_config": "http://localhost:5000/phenotype_id/11020",
   "pipeline_ids": [
        12529,
        12530,
        12531,
        12532,
        12533,
        12534,
        12535
    ],
    "pipeline_configs": [
        "http://localhost:5000/pipeline_id/12529",
        "http://localhost:5000/pipeline_id/12530",
        "http://localhost:5000/pipeline_id/12531",
        "http://localhost:5000/pipeline_id/12532",
        "http://localhost:5000/pipeline_id/12533",
        "http://localhost:5000/pipeline_id/12534",
        "http://localhost:5000/pipeline_id/12535"
    ],
    "status_endpoint": "http://localhost:5000/status/11108",
    "results_viewer": "?job=11108",
    "luigi_task_monitoring": "http://localhost:8082/static/visualiser/index.html#search__search=job=11108",
    "intermediate_results_csv": "http://localhost:5000/job_results/11108/phenotype_intermediate",
    "main_results_csv": "http://localhost:5000/job_results/11108/phenotype"
}

Here we see various items relevant to the job submission. Each submission receives a job_id, which is a unique numerical identifier for the run. ClarityNLP writes all task results from all jobs to the phenotype_results collection in a Mongo database named nlp. The job_id is needed to distinguish the data belonging to each run. Results can be extracted directly from the database by issuing MongoDB queries.

We also see URLs for ‘intermediate’ and ‘main’ phenotype results. These are convenience APIs that export the results to CSV files. The data in the intermediate result CSV file contains the output from each NLPQL task not marked as final. The main result CSV contains the results from any final tasks or final expression evaluations. The CSV file can be viewed in Excel or in another spreadsheet application.

Each NLP task generates a result document distinguished by a particular value of the nlpql_feature field. The define statement

define hasFever:
     where Temperature.value >= 100.4;

generates a set of rows in the intermediate CSV file with the nlpql_feature field set to hasFever. The NLP tasks

// nlpql_feature 'hasRigors'
define hasRigors:
    Clarity.ProviderAssertion({
        termset: [RigorsTerms],
        documentset: [ProviderNotes]
    });

// nlpql_feature 'hasDyspnea
define hasDyspnea:
    Clarity.ProviderAssertion({
        termset: [DyspneaTerms],
        documentset: [ProviderNotes]
    });

generate two blocks of rows in the CSV file, the first block having the nlpql_feature field set to hasRigors and the next block having it set to hasDyspnea. The different nlpql_feature blocks appear in order as listed in the source NLPQL file. The presence of these nlpql_feature blocks makes locating the results of each NLP task a relatively simple matter.

Expression Evaluation Algorithms

ClarityNLP evaluates expressions via a multi-step procedure. In this section we describe the different processing stages.

Expression Tokenization and Parsing

The NLPQL front end parses the NLPQL file and sends the raw expression text to the evaluator (nlp/data_access/expr_eval.py). The evaluator module parses the expression text and converts it to a fully-parenthesized token string. The tokens are separated by whitespace and all operators are replaced by string mnemonics (such as GE for the operator >=, LT for the operator <, etc.).

If the expression includes any subexpressions involving numeric literals, they are evaluated at this stage and the literal subexpression replaced with the result.

Validity Checks

The evaluator then runs validity checks on each token. If it finds a token that it does not recognize, it tries to resolve it into a series of known NLPQL features separated by logic operators. For instance, if the evaluator were to encounter the token hasRigorsANDhasDyspnea under circumstances in which only hasRigors and hasDyspnea were valid NLPQL features, it would replace this single token with the string hasRigors AND hasDyspnea. If it cannot perform the separation (such as with the token hasRigorsA3NDhasDyspnea) it reports an error and writes error information into the log file.

If the validity checks pass, the evaluator next determines the expression type. The valid types are EXPR_TYPE_MATH, EXPR_TYPE_LOGIC, and EXPR_TYPE_MIXED. If the expression type cannot be determined, the evaluator reports an error and writes error information into the log file.

Subexpression Substitution

If the expression is of mixed type, the evaluator locates all simple math subexpressions contained within and replaces them with temporary NLPQL feature names, thereby converting math subexpressions to logic subexpressions. The substitution process continues until all mathematical subexpressions have been replaced with substitute NLPQL features, at which point the expression type becomes EXPR_TYPE_LOGIC.

To illustrate the substitution process, consider one of the examples from above:

Temperature.value >= 100.4 AND (hasRigors OR hasNausea) AND (LesionMeasurement.dimension_X >= 15)

This expression is of mixed type, since it contains the mathematical subexpression Temperature.value >= 100.4, the logic subexpression (hasRigors OR hasNausea), and the mathematical subexpression (LesionMeasurement.dimension_X >= 15). The NLPQL features in each math subexpression, Temperature and LesionMeasurement, also differ.

The evaluator identifies the Temperature subexpression and replaces it with a substitute NLPQL feature, m0 (for instance). This transforms the original expression into:

(m0) AND (hasRigors OR hasNausea) AND (LesionMeasurement.dimension_X >= 15)

Now only one mathematical subexpression remains.

The evaluator again makes a substitution m1 for the remaining mathematical subexpression, which converts the original into

(m0) AND (hasRigors OR hasNausea) AND (m1)

This is now a pure logic expression.

Thus the substitution process transforms the original mixed-type expression into three subexpressions, each of which is of simple math or simple logic type:

subexpression 1 (m0): 'Temperature.value >= 100.4'
subexpression 2 (m1): 'LesionMeasurement.dimension_X >= 15'
subexpression 3:      '(m0) AND (hasRigors OR hasNausea) AND (m1)'

By evaluating each subexpression in order, the result of evaluating the original mixed-type expression can be obtained.

Evaluation of Mathematical Expressions

Removal of Unnecessary Parentheses

The evaluator next removes all unnecessary pairs of parentheses from the mathematical expression. A pair of parentheses is unnecessary if it can be removed without affecting the result. The evaluator detects changes in the result by converting the expression with a pair of parentheses removed to postfix, then comparing the postfix form with that of the original. If the postfix expressions match, that pair of parentheses was non-essential and can be discarded. The postfix form of the expression has no parentheses, as described below.

Conversion to Explicit Form

After removal of nonessential parentheses, the evaluator rewrites the expression so that the tokens match what’s actually stored in the database. This involves an explicit comparison for the NLPQL feature and the unadorned use of the field name for variables. To illustrate, consider the hasFever example above:

define hasFever:
    where Temperature.value >= 100.4;

The expression portion of this define statement is Temperature.value >= 100.4. The evaluator rewrites this as:

(nlpql_feature == Temperature) AND (value >= 100.4)

In this form the tokens match the fields actually stored in the task result documents in MongoDB.

Conversion to Postfix

Direct evaluation of an infix expression is complicated by parenthesization and operator precedence issues. The evaluation process can be greatly simplified by first converting the infix expression to postfix form. Postfix expressions require no parentheses, and a simple stack-based evaluator can be used to evaluate them directly.

Accordingly, a conversion to postifx form takes place next. This conversion process requires an operator precedence table. The NLPQL operator precedence levels match those of Python and are listed here for reference. Lower numbers imply lower precedence, so or has a lower precedence than and, which has a lower precedence than +, etc.

Operator Precedence Value
( 0
) 0
or 1
and 2
not 3
< 4
<= 4
> 4
>= 4
!= 4
== 4
+ 9
- 9
* 10
/ 10
% 10
^ 12

Conversion from infix to postfix is unambiguous if operator precedence and associativity are known. Operator precedence is given by the table above. All NLPQL operators are left-associative except for exponentiation, which is right-associative. The infix-to-postfix conversion algorithm is the standard one and can be found in the function _infix_to_postfix in the file nlp/data_access/expr_eval.py.

After conversion to postfix, the hasFever expression becomes:

'nlpql_feature', 'Temperature', '==', 'value', '100.4', '>=', 'and'

Generation of the Aggregation Pipeline

The next task for the evaluator is to convert the expression into a sequence of MongoDB aggregation pipeline stages. This process involves the generation of an initial $match query to filter out everything but the data for the current job. The match query also checks for the existence of all entries in the field list and that they have non-null values. A simple existence check is not sufficient, since a null field actually exists but has a value that cannot be used for computation. Hence checks for existence and a non-null value are both necessary.

For the hasFever example, the initial match query generates a pipeline filter stage that looks like this, assuming a job_id of 12345:

{
    "$match": {
        "job_id": 12345,
        "nlpql_feature": {"$exists":True, "$ne":None},
        "value"        : {"$exists":True, "$ne":None}
    }
}

This match pipeline stage runs first and performs coarse filtering on the data in the result database. It finds only those task result documents matching the specified job_id, and it further restricts consideration to those documents having valid entries for the expression’s fields.

Subsequent Pipeline Stages

After generation of the initial match filter stage, the postfix expression is then ‘evaluated’ by a stack-based mechanism. The result of the evaluation process is not the actual expression value, but instead a set of MongoDB aggregation commands that tell MongoDB how to compute the result. The evaluation process essentially generates Python dictionaries that obey the aggregation syntax rules. More information about the aggregation pipeline can be found here.

The pipeline actually does a $project operation and creates a new document with a Boolean field called value. This field has a value of True or False according to whether the source document satisfied the mathematical expression. The _id field of the projected document matches that of the original, so that a simple query on these _id fields can be used to recover the desired documents.

The final aggregation pipeline for our example becomes:

// (nlpql_feature == Temperature) and (value >= 100.4)
{
   "$match": {
       "job_id":12345
       "nlpql_feature": {"$exists":True, "$ne":None},
       "value"        : {"$exists":True, "$ne":None}
   }
},
{
    "$project" : {
        "value" : {
            "$and" : [
                {"$eq"  : ["$nlpql_feature", "Temperature"]},
                {"$gte" : ["$value", 100.4]}
            ]
        }
    }
}

The completed aggregation pipeline gets sent to MongoDB for evaluation. Mongo performs the initial filtering operation, applies the subsequent pipeline stages to all surviving documents, and sets the “value” Boolean result. A final query extracts the matching documents and writes new result documents with an nlpql_feature field equal to the label from the define statement, which for this example would be hasFever.

Evaluation of Logic Expressions

The initial stages of the evaluation process for logic expressions proceed similarly to those for mathematical expressions. Unnecessary parentheses are removed and the expression is converted to postfix.

Detection of n-ary AND and OR

After the postfix conversion, a pattern matcher looks for instances of n-ary AND and/or OR in the set of postfix tokens. An n-ary OR would look like this, for n == 4:

// infix
hasRigors OR hasDyspnea OR hasTachycardia OR hasNausea

// postfix
hasRigors hasDyspnea OR hasTachycardia OR hasNausea OR

The n-value refers to the number of operands. All such n-ary instances are replaced with a variant form of the operator that includes the count. The reason for this is that n-ary AND and OR can be handled easily by the aggregation pipeline, and their use simplifies the pipeline construction process. For this example, the rewritten postfix form would become:

hasRigors hasDyspnea hasTachycardia hasNausea OR4

Generation of the Aggregation Pipeline

As with mathematical expressions, the logic expression aggregation pipeline begins with an initial stage that filters on the job_id and checks that the nlpql_feature field exists and is non-null. No explicit field checks are needed since logic expressions do not use NLPQL variables. For a job_id of 12345, this inital filter stage is:

{
    "$match": {
        "job_id":12345
        "nlpql_feature": {"$exists":True, "$ne":None}
    }
}

Following this is another filter stage that removes all docs not having the desired NLPQL features. For the original logic expression example above:

hasFever AND (hasDyspnea OR hasTachycardia)

this second filter stage would look like this:

{
    "$match": {
        "nlpql_feature": {"$in": ['hasFever', 'hasDyspnea', 'hasTachycardia']}
    }
}

Grouping by Value of the Context Variable

The next stage in the logic pipeline is to group documents by the value of the context field. Recall that NLPQL files specify a context of either ‘document’ or ‘patient’, meaning that a document-centric or patient-centric view of the results is desired. In a document context, ClarityNLP needs to examine all data pertaining to a given document. In a patient context, it needs to examine all data pertaining to a given patient.

The grouping operation collects all such data (the ClarityNLP task result documents) that pertain to a given document or a given patient. Documents are distinguished by their report_id field, and patients are distinguished by their patient IDs, which are stored in the subject field. You can think of these groups as being the ‘evidence’ for a given document or for a given patient. If the patient has the conditions expressed in the NLPQL file, the evidence for it will reside in the group for that patient.

As part of the grouping operation ClarityNLP also generates a set of NLPQL features for each group. This set is called the feature_set and it will be used to evaluate the expression logic for the group as a whole.

The grouping pipeline stage looks like this:

{
    "$group": {
        "_id": "${0}".format(context_field),

        # save only these four fields from each doc; more efficient
        # than saving entire doc, uses less memory
        "ntuple": {
            "$push": {
                "_id": "$_id",
                "nlpql_feature": "$nlpql_feature",
                "subject": "$subject",
                "report_id": "$report_id"
            }
        },
        "feature_set": {"$addToSet": "$nlpql_feature"}
    }
}

Here we see the $group operator grouping the documents on the value of the context field. An ntuple array is generated for each different value of the context variable. This is the ‘evidence’ as discussed above. Only the essential fields for each document are used, which reduces memory consumption and improves efficiency. We also see the generation of the feature set for each group, in which each NLPQL feature for the group’s documents is added to the set.

At the conclusion of this pipeline stage, each group has two fields: an ntuple array that contains the relevant data for each document in the group, and a feature_set field that contains the distinct features for the group.

Logic Operation Stage

After the grouping operation, the logic operations of the expression are applied to the elements of the feature set. If a particular patient satisfies the hasFever condition, then at least one document in that patient’s group will have an NLPQL feature field with the value of hasFever. Since all the distinct values of the NLPQL features for the group are stored in the feature set, the feature set must also have an element equal to hasFever.

A check for set membership using aggregation syntax is expressed as:

{"$in": ["hasFever", "$feature_set"]}

This construct means to use the $in operator to test whether feature_set contains the element hasFever. The $in operator returns a Boolean result.

A successful test for feature set membership means that the patient has the stated feature.

The evaluator implements the expression logic by translating it into a series of set membership tests. For our example above, the logic operation pipeline stage becomes:

{
    '$match': {
        '$expr': {
            '$and': [
                {'$in': ['hasFever', '$feature_set']},
                {
                    '$or': [
                        {'$in': ['hasDyspnea', '$feature_set']},
                        {'$in': ['hasTachycardia', '$feature_set']}
                    ]
                }
            ]
        }
    }
}

Once again we have a match operation to filter the documents. Only those documents satisfying the expression logic will survive the filter. The $expr operator allows the use of aggregation syntax in contexts where the standard MongoDB query syntax would be required.

Following that we see a series of logic operations for our expression hasFever AND (hasDyspnea OR hasTachycardia). The inner $or operation tests the feature set for membership of hasDyspnea and hasTachycardia. If either or both are present, the $or operator returns True. The result of the $or is then used in an $and operation which tests the feature set for the presence of hasFever. If it is also present, the $and operator returns True as well, and the document in question survives the filter operation.

To summarize the evaluation process so far: ClarityNLP converts infix logic expressions to postfix form and groups the documents by value of the context variable. It uses a stack-based postfix evaluation mechanism to generate the aggregation statements for the expression logic. Each logic operation is converted to a test for the presence of an NLPQL feature in the feature set.

Final Aggregation Pipeline

With these operations the pipeline is complete. The full pipeline for our example is:

// aggregation pipeline for hasFever AND (hasDyspnea OR hasTachycardia)

// filter documents on job_id and check validity of the nlpql_feature field
{
    "$match": {
        "job_id":12345
        "nlpql_feature": {"$exists":True, "$ne":None}
    }
},

// filter docs on the desired NLPQL feature values
{
    "$match": {
        "nlpql_feature": {"$in": ['hasFever', 'hasDyspnea', 'hasTachycardia']}
    }
},

// group docs by value of context variable and create feature set
{
    "$group": {
        "_id": "${0}".format(context_field),
        "ntuple": {
            "$push": {
                "_id": "$_id",
                "nlpql_feature": "$nlpql_feature",
                "subject": "$subject",
                "report_id": "$report_id"
            }
        },
        "feature_set": {"$addToSet": "$nlpql_feature"}
    }
},

// perform expression logic on the feature set
{
    '$match': {
        '$expr': {
            '$and': [
                {'$in': ['hasFever', '$feature_set']},
                {
                    '$or': [
                        {'$in': ['hasDyspnea', '$feature_set']},
                        {'$in': ['hasTachycardia', '$feature_set']}
                    ]
                }
            ]
        }
    }
}

Result Generation

After constructing a math or logic aggregation pipeline, the evaluator runs the pipeline and receives the results from MongoDB. The result set is either a list of document ObjectID values (_id) for a math expression or an ObjectId list with group info for logic expressions. For math expressions, the documents whose _id values appear in the list are queried and written out as the result set. These documents have their nlpql_feature field set to that of the define statement that contained the expression.

For logic expressions the process is more complex. To help explain what the evaluator does we present here a representation of the grouped documents after running the pipeline above, for the expression hasFever AND (hasDyspnea OR hasTachycardia):

ObjectId (_id) nlpql_feature subject report_id
5c2e9e3431ab5b05db3430e1 hasDyspnea 19054 798209
5c2e9e3431ab5b05db3430e2 hasDyspnea 19054 798209
5c2e9e3431ab5b05db3430e3 hasDyspnea 19054 798209
5c2e9e3431ab5b05db3430e4 hasDyspnea 19054 798209
5c2e9ec931ab5b05db343efa hasDyspnea 19054 1303796
5c2ea2bd31ab5b05db34868c hasTachycardia 19054 1699977
5c2ea2bd31ab5b05db34868d hasTachycardia 19054 1699977
5c2ea35a31ab5b05db348f19 hasTachycardia 19054 1802359
5c2ea3a531ab5b05db3492f6 hasTachycardia 19054 1905337
5c2ea42431ab5b05db34998c hasTachycardia 19054 1802375
5c2ea42431ab5b05db34998d hasTachycardia 19054 1802375
5c2eb55831ab5b05db35097b hasFever 19054 [‘1264178’]
5c2eb55831ab5b05db350d45 hasFever 19054 [‘1699944’]
5c2eb55831ab5b05db350d46 hasFever 19054 [‘1699944’]

Here we see a representation of the document group for patient 19054. This group of documents can be considered to be the “evidence” for this patient. In the ObjectID column are the MongoDB ObjectID values for each task result document or mathematical result document. The nlpql_feature column shows which NLPQL feature ClarityNLP found for that document. The subject column shows that all documents in the group belong to patient 19054, and the report_id column shows the document identifier.

We see that patient 19054 has five instances of hasDyspnea, six instances of hasTachycardia, and three instances of hasFever. You can consider this group as being composed of three subgroups with five, six, and three elements each.

ClarityNLP presents result documents in a “flattened” format. For each NLPQL label introduced in a “define” statement, ClarityNLP generates a set of result documents containing that label in the nlpql_feature field. Each result document also contains a record of the source documents that were used as evidence for that label.

Flattening of the Result Group

To flatten these results and generate a set of output documents labeled by the hasSymptoms NLPQL feature (from the original “define” statement), ClarityNLP essentially has two options:

  • generate all possible ways to derive hasSymptoms from this data
  • generate the minimum number of ways to derive hasSymptoms from this data (while not ignoring any data)

The maximal result set can be generated by the following reasoning. First, in how many ways can patient 19054 satisfy the condition hasDyspnea OR hasTachycardia? From the data in the table, there are five ways to satisfy the hasDyspnea condition and six ways to satisfy the hasTachycardia condition, for a total of 5 + 6 = 11 ways. Then, for each of these ways, there are three ways for the patient to satisfy the condition hasFever. Thus there are a total of 3 * (5 + 6) = 3 * 11 = 33 ways for this patient to satisfy the condition hasFever AND (hasDyspnea OR hasTachycardia), which would result in the generation of 33 output documents under a maximal representation.

The minimal result set can be generated by the following reasoning. We have seen that there are 11 ways for this patient to satisfy the condition hasDyspnea OR hasTachycardia. Each of these must be paired with a hasFever, from the logical AND operator in the expression. By repeating each of the hasFever entries, we can “tile” the output and pair a hasFever with one of the 11 others. This procedure generates a result set containing only 11 entries instead of 33. It uses all of the output data, and it minimizes data redundancy.

In general, the cardinalities of the sets of NLPQL features connected by logical OR are added together to compute the number of possible results. For features connected by logical AND, the cardinalities are multiplied to get the total number of possiblilities under a maximal representation (this is the Cartesian product). Under a minimal representation, the cardinality of the result is equal to the maximum cardinality of the constitutent subsets.

So which output representation does ClarityNLP use?

ClarityNLP uses the minimal representation of the output data.

Here is what the result set looks like using a minimal representation. Each of the 11 elements contains a pair of documents, one with the feature hasFever and the other having either hasDyspnea or hasTachycardia, as required by the expression. We show only the last four hex digits of the ObjectID for clarity:

// expression: hasFever AND (hasDyspnea OR hasTachycardia)

('097b', 'hasFever'), ('30e1', 'hasDyspnea')
('0d45', 'hasFever'), ('30e2', 'hasDyspnea')
('0d46', 'hasFever'), ('30e3', 'hasDyspnea')
('097b', 'hasFever'), ('30e4', 'hasDyspnea')
('0d45', 'hasFever'), ('3efa', 'hasDyspnea')
('0d46', 'hasFever'), ('868c', 'hasTachycardia')
('097b', 'hasFever'), ('868d', 'hasTachycardia')
('0d45', 'hasFever'), ('8f19', 'hasTachycardia')
('0d46', 'hasFever'), ('92f6', 'hasTachycardia')
('097b', 'hasFever'), ('998c', 'hasTachycardia')
('0d45', 'hasFever'), ('998d', 'hasTachycardia')

Note that the three hasFever entries repeat three times, followed by another repeat of the first two entries to make a total of 11. Each of these is paired with one of the five hasDyspnea entries or one of the six hasTachycardia entries. No data for this patient has been lost, and the result is 11 documents in a flattened format satisfying the logic of the original expression.

Testing the Expression Evaluator

There is a comprehensive test program for the expression evaluator in the file nlp/data/access/expr_tester.py. The test program requires a running instance of MongoDB. We strongly recommend running Mongo on the same machine as the test program to minimize data transfer delays.

The test program loads a data file into MongoDB and evaluates a suite of expressions using the data. The expression logic is separately evaluated with Python set operations. The results from the two evaluations are compared and the tests pass only if both evaluations produce identical sets of patients.

The test program can be run from the command line. For usage info, run with the --help option:

python3 ./expr_tester.py --help

The test program assumes that the user has permission create a database without authentication.

To run the test suite with the default options, first launch MongoDB on your local system. Information about how to do that can be found in our native setup guide.

After MongoDB initializes, run the test program with this command, assuming the default Mongo port of 27017:

python3 ./expr_tester.py

If your MongoDB instance is hosted elsewhere or uses a non-default port number, provide the connection parameters explicitly:

python3 ./expr_tester.py --mongohost <ip_address> --mongoport <port_number>

The test program takes several minutes to run. Upon completion it should report that all tests passed.