Solr Setup and Configuration¶
Data types¶
We use standard Solr data types with one custom data type, searchText
. searchText
is a text field, tokenized on spaces, with filtering to support case insensitivity.
Fields¶
All documents in ClarityNLP are stored in Solr. These are the minimal required fields:
{
"report_type":"Report Type",
"id":"1",
"report_id":"1",
"source":"My Institution",
"report_date":"1970-01-01T00:00:00Z",
"subject":"the_patient_id_or_other_identifier",
"report_text":"Report text here"
}
id
and report_id
should be unique in the data set, but can be equal. report_text
should be plain text. subject
is generally the patient identifier, but could also be some other identifier, such as drug_name
.
source
is generally your institution or the name of the document set.
Additional fields can be added to store additional metadata. The following fields are allowable as dynamic fields:
*_section
(searchText); e.g.past_medical_history_section
(for indexing specific sections of notes)*_id
(long) e.g.doctor_id
(any other id you wish to store)*_ids
(long, multiValued) e.g.medication_ids
(any other id as an array)*_system
(string) e.g.code_system
(noting any system values)*_attr
(string) e.g.clinic_name_attr
(any single value custom attribute)*_attrs
(string, multiValued) e.g.insurer_names
(any multi valued custom attribute)
Custom Solr Setup¶
This should be completed for you if you are using Docker. However, here are the commands to setup Solr.
- Install Solr
- Setup custom tokenized field type:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field-type" : {
"name":"searchText",
"class":"solr.TextField",
"positionIncrementGap":"100",
"analyzer" : {
"charFilters":[{
"class":"solr.PatternReplaceCharFilterFactory",
"replacement":"$1$1",
"pattern":"([a-zA-Z])\\\\1+" }],
"tokenizer":{
"class":"solr.WhitespaceTokenizerFactory" },
"filters":[{
"class":"solr.WordDelimiterFilterFactory",
"preserveOriginal":"0" },
{"class": "solr.LowerCaseFilterFactory"
}]}}
}' http://localhost:8983/solr/report_core/schema
- Add standard fields (Solr 6):
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{"name":"report_date","type":"date","indexed":true,"stored":true,"default":"NOW"},
"add-field":{"name":"report_id","type":"string","indexed":true,"stored":true},
"add-field":{"name":"report_text","type":"searchText","indexed":true,"stored":true,"termPositions":true,"termVectors":true,"docValues":false,"required":true},
"add-field":{"name":"source","type":"string","indexed":true,"stored":true},
"add-field":{"name":"subject","type":"string","indexed":true,"stored":true},
"add-field":{"name":"report_type","type":"string","indexed":true,"stored":true}
}' http://localhost:8983/solr/report_core/schema
- Add standard fields (Solr 7 and later):
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{"name":"report_date","type":"pdate","indexed":true,"stored":true,"default":"NOW"},
"add-field":{"name":"report_id","type":"string","indexed":true,"stored":true},
"add-field":{"name":"report_text","type":"searchText","indexed":true,"stored":true,"termPositions":true,"termVectors":true,"docValues":false,"required":true},
"add-field":{"name":"source","type":"string","indexed":true,"stored":true},
"add-field":{"name":"subject","type":"string","indexed":true,"stored":true},
"add-field":{"name":"report_type","type":"string","indexed":true,"stored":true}
}' http://localhost:8983/solr/report_core/schema
- Add dynamic fields (Solr 6):
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-dynamic-field":{"name":"*_section","type":"searchText","indexed":true,"stored":false},
"add-dynamic-field":{"name":"*_id","type":"long","indexed":true,"stored":true},
"add-dynamic-field":{"name":"*_ids","type":"long","multiValued":true,"indexed":true,"stored":true},
"add-dynamic-field":{"name":"*_system","type":"string","indexed":true,"stored":true},
"add-dynamic-field":{"name":"*_attr","type":"string","indexed":true,"stored":true},
"add-dynamic-field":{"name":"*_attrs","type":"string","multiValued":true,"indexed":true,"stored":true}
}' http://localhost:8983/solr/report_core/schema
- Add dynamic fields (Solr 7 and later):
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-dynamic-field":{"name":"*_section","type":"searchText","indexed":true,"stored":false},
"add-dynamic-field":{"name":"*_id","type":"plong","indexed":true,"stored":true},
"add-dynamic-field":{"name":"*_ids","type":"plongs","multiValued":true,"indexed":true,"stored":true},
"add-dynamic-field":{"name":"*_system","type":"string","indexed":true,"stored":true},
"add-dynamic-field":{"name":"*_attr","type":"string","indexed":true,"stored":true},
"add-dynamic-field":{"name":"*_attrs","type":"strings","multiValued":true,"indexed":true,"stored":true}
}' http://localhost:8983/solr/report_core/schema
Deleting documents¶
These commands will permanently delete your documents; use with caution.
Delete documents based on a custom query:
curl "http://localhost:8983/solr/report_core/update?commit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>source:"My Source"</query></delete>'
Delete all documents:
curl "http://localhost:8983/solr/report_core/update?commit=true" -H "Content-Type: text/xml" --data-binary '<delete><query>*:*</query></delete>'