Technologies We Use

ClarityNLP is built on several popular open-source projects. In this section we provide a brief overview of each project and describe how it is used by ClarityNLP.

Docker

Docker uses operating-system-level virtualization to provide a means of isolating applications from each other and controlling their access to system resources. Isolated applications run in restricted environments called containers. A container includes the application and all dependencies so that it can be deployed as a self-contained unit.

ClarityNLP can be deployed as a set of Docker containers. The secure OAuth2-based server configuration assumes this deployment mechanism. You can find out more about the ClarityNLP setup options and our use of Docker in our setup documentation.

Solr

Apache Solr is an enterprise search platform with many advanced features including fault tolerance, distributed indexing, and the ability to scale to billions of documents. It is fast, highly configurable, and supports a wide range of user customizations.

ClarityNLP uses Solr as its primary document store. Any documents that ClarityNLP processes must be retrieved from Solr. We provide instructions on how to ingest documents into Solr. We also provide some python scripts to help you with common data sets. See our document ingestion documentation for more.

PostgresSQL

PostgreSQL is one of the leading open-source relational database systems, distinguished by its robust feature set, ACID compliance, and excellent performance. ClarityNLP uses Postgres to store data required to manage each NLPQL job. Postgres is also used to store a large amount of medical vocabulary and concept data.

MongoDB

MongoDB is a popular NoSQL document store. A mongo document is a JSON object with user-defined fields and values. There is no rigid structure imposed on documents. Multiple documents form groups called collections, and one or more collections comprise a database.

ClarityNLP uses Mongo to store the results that it finds. The ClarityNLP built-in and custom tasks all define result documents with fields meaningful to each task. ClarityNLP augments the result documents with additional job-specific fields and stores everything in a single collection.

ClarityNLP also evaluates NLPQL expressions by translating them into a MongoDB aggregation pipeline.

NLP Libraries (spaCy, textacy, nltk)

The natural language processing libraries spaCy and nltk provide implementations of the fundamental NLP algorithms that ClarityNLP needs. These algorithms include sentence segmentation, part-of-speech tagging, and dependency parsing, among others. ClarityNLP builds its NLP algorithms on top of the foundation provided by spaCy and nltk.

Textacy is a higher-level NLP library built on spaCy. ClarityNLP uses textacy for its Clarity.ngram task and for computing text statistics with Clarity.TextStats.

Luigi

Luigi is a python library that manages and schedules pipelines of batch processes. A pipeline is an ordered sequence of tasks needed to compute a result. The tasks in the pipeline can have dependencies, which are child tasks that must run and finish before the parents can be scheduled to run. Luigi handles the task scheduling, dependency management, restart-on-failure, and other necessary aspects of managing these pipelines.

The NLPQL Reference defines a set of core and custom tasks that comprise the data processing capabilities of ClarityNLP. ClarityNLP uses Luigi to schedule and manage the execution of these tasks.

Flask

Flask is a “micro” framework for building Web applications. Flask provides a web server and a minimal set of core features, as well as an extension mechanism for including features found in more comprehensive Web frameworks.

The ClarityNLP component that provides the NLP Web APIs is built with Flask.

Redis

Redis is an in-memory key-value store that is typically used as a fast cache for frequently-accessed data. The values mapped to each key can either be strings or more complex data structures. Redis supports many advanced features such as partitioning and time-based key expiration.

ClarityNLP uses Redis as a fast query cache.

Pandas

Pandas is a python library for data analysis, with particular strengths in manipulating tabular and labeled data. It provides data structures and methods for doing operations that one would typically use a spreadsheet for. It provides a powerful I/O library and integrates fully with the python machine learning, data analysis, and visualization stack.

ClarityNLP uses pandas for some I/O operations and for various forms of data manipulation.

Client-side Libraries (React, Sails)

TBD