Finding Date Expressions¶
Overview¶
ClarityNLP includes a module that locates date expressions in clinical text.
By ‘date expression’ we mean a string such as July 20, 1969
, 7.20.69
,
or something similar. The DateFinder
module scans sentences for date
expressions, extracts them, and generates output in JSON format.
Source Code¶
The source code for the date finder module is located in
nlp/algorithms/finder/date_finder.py
.
Inputs¶
A single string, the sentence to be scanned for date expressions.
Outputs¶
A JSON array containing these fields for each date expression found:
Field Name | Explanation |
---|---|
text | string, text of the complete date expression |
start | integer, offset of the first character in the matching text |
end | integer, offset of the final character in the matching text plus 1 |
year | integer year |
month | integer month (Jan=1, Feb=2, …, Dec=12) |
day | integer day of the month [1, 31] |
All JSON results contain an identical number of fields. Any fields that are not valid for a given date expression will have a value of EMPTY_FIELD and should be ignored.
Algorithm¶
ClarityNLP uses a set of regular expressions to recognize date expressions.
The date_finder module scans a sentence with each date-finding regex and
keeps track of any matches. If any matches overlap, an overlap resolution
process is used to select a winniner. Each winning match is converted to a
DateValue
namedtuple. This object is defined at the top of the source code
module and can be imported by other Python code. Each namedtuple is appended
to a list as the sentence is scanned. After scanning completes, the list of
DateValue
namedtuples is converted to JSON and returned to the caller.
Date Expression Formats¶
Using notation similar to that used by the PHP date reference, we define the following quantities:
Shorthand | Meaning |
---|---|
dd | one or two-digit day of the month with optional suffix (7th, 22nd, etc.) |
DD | two-digit day of the month |
m | textual name of the month |
M | textual month abbreviation |
mm | one or two-digit numerical month |
MM | two-digit month |
y | two or four-digit year |
yy | two-digit year |
YYYY | four-digit year |
? | optional |
With these definitions, the date expression formats that ClarityNLP recognizes are (using the date of the first Moon landing for illustration):
Date Expression Format | Examples |
---|---|
YYYYMMDD | 19690720 |
[-+]?YYYY-MM-DD | +1969-07-20 |
YYYY/MM/DD | 1969/07/20 |
YY-MM-DD | 69-07-20 |
YYYY-MM-DDTHH:MM:SS(.ffffff)? | |
(here MM:SS means minutes and seconds) | |
mm/dd/YYYY | 07/20/1969 |
YYYY/mm/dd | 1969/7/20, 1969/07/20 |
dd-mm-YYYY, dd.mm.YYYY | 20-07-1969, 20.7.1969 |
y-mm-dd | 1969-7-20, 1969-07-20, 69-7-20 |
dd.mm.yy | 20.7.69, 20.07.69 |
dd-m-y, ddmy, dd m y | 20-JULY-69, 20JULY69, 20 July 1969 |
m-dd-y, m.dd.y, mddy, m dd, y | 20-July 1969, 20JULY1969, 20 July, 1969 |
M-DD-y | Jul-20-1969, Jul-20-69 |
y-M-DD | 69-Jul-20, 1969-Jul-20 |
mm/dd | 7/20, 07/20 |
m-dd, m.dd, m dd | July 20, July 20th, July-20 |
dd-m, dd.m, dd m | 20-July, 20.July, 20 July |
YYYY-mm | 1969-07, 1969-7 |
m-YYYY, m.YYYY, m YYYY | July-1969, July.1969, July 1969 |
YYYY-m, YYYY.m, YYYY m | 1969-July, 1969.July, 1969 July |
YYYY | 1969 |
m | July |