Introduction¶

Physician reviews are a crucial source of information for patients researching healthcare providers online. However, the vast number of reviews available across multiple platforms can be overwhelming. Allowing patients to more easily surface relevant information quickly would offer a much better search experience.

A prior analysis of three million physician reviews revealed that their content is generally constrained to six predominant themes:

Bedside manner: Refers to the physician's tone and demeanor during visits.
Communication: The physician's clarity and effectiveness in communication.
Quality of care: The patient's perception of the quality of care they received.
Office staff: The support staff managing patient intake and appointment scheduling.
Quality time: The patient's perception that the visit was valuable and worthwhile.
Wait time: The time between patient arrival and the start of the visit.

Approximately 95% of all reviews reference at least one of these themes, with "quality of care" being the most prevalent (present in 80% of reviews). Over 62% of reviews contain more than one theme, with one or two being the most common.

A classification model will be built to identify these themes in new reviews, helping to extract and surface the information most crucial to patients. However, to build such a model, labeled training data are required. To expedite the process, data were labeled using Amazon's crowd-sourcing platform, Mechanical Turk. Each review was evaluated by three participants, who assigned up to six theme labels (see figure below). A label was finalized only when two or more participants agreed.

Let's get started by importing some useful libraries and reading in the data.

%config InlineBackend.figure_format = 'retina'

from sklearn.model_selection import train_test_split
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from nltk.corpus import stopwords
from gensim.models import Word2Vec
import gensim.downloader

from matplotlib.pyplot import *
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 500)

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import GlobalMaxPooling1D
from tensorflow.keras.layers import SpatialDropout1D
from tensorflow.keras.layers import Bidirectional, GRU
from tensorflow.keras.metrics import Precision, Recall, AUC
from tensorflow.keras.models import Sequential

We can start by defining a function to remove punctuation and special characters from the text. This will serve to standardize the reviews and will improve model performance later on.

def process_text(text):
    result = text.lower().strip()
    result = re.sub(r'\n', ' ', result)
    result = re.sub(r'\t', ' ', result)
    result = re.sub(r'\r', ' ', result)
    result = re.sub(r'[^\w\s\']', ' ', result)
    result = re.sub(r'[^\x00-\x7F]+', ' ', result)
    result = re.sub(r' +', ' ', result).strip()
    return result

The data reside in eight separate csv files. Let's read each of them in and redefine some column names.

file_id_list = ['3741148', '3741760', '3742390', '3743847', '3745116', '3746169', '3746609', '3749005']

df = pd.DataFrame()

for i in file_id_list:

    df = pd.concat([df, pd.read_csv(f'data/batch_{i}_results.csv')], axis=0)

df.columns = [i.split('Input.')[-1].split('Answer.')[-1].lower() for i in df.columns]

df = df[df['assignmentstatus'] != 'Rejected'].sort_values('hitid', ascending=False).reset_index(drop=True)

There were a few cases where only two Mechanical Turk participants labeled a particular review. These will be discarded from the analysis.

discard_reviews = df.groupby('hitid')['hitid'].count()[lambda x: x < 3].index.tolist()

df = df[~df['hitid'].isin(discard_reviews)].reset_index(drop=True)

Define the Target Columns¶

target_columns = ['bedside_manner', 'communication', 'quality_of_care', 'office_staff', 'quality_time', 'wait_time']

df[target_columns] = df[target_columns].replace({-1: 1, 0: 1})

df[target_columns] = df[target_columns].replace({-2: 0, 2: 0})

tmp = df.groupby('usersurveyid')[target_columns].sum().reset_index()

tmp[target_columns] = tmp[target_columns].replace({1: 0, 2: 1, 3: 1})

data = df[['usersurveyid', 'text']].merge(tmp, on='usersurveyid', how='inner').drop_duplicates('usersurveyid').reset_index(drop=True)

data[target_columns].apply(lambda x: x.value_counts())

	bedside_manner	communication	quality_of_care	office_staff	quality_time	wait_time
0	13622	15804	4596	19068	23364	23357
1	10893	8711	19919	5447	1151	1158

The data are very imbalanced across several of the targets. For example only about 4.5% of all reviews touch on the "quality time" or "wait time" themes.

Let's view the first few rows of the data.

data.head()

	usersurveyid	text	bedside_manner	communication	quality_of_care
0	16662319	Love her. Griffin has been seeing her since h...	0	0	0
1	16671318	He is an amazing, reliable and intelligent phy...	1	0	1
2	16670938	Melinda is an AMAZING therapist! She is caring...	1	0	1
3	16673989	Nice dentist that really cares about you as a ...	1	0	1
4	16556026	Dr. Kury Perez is very nice and friendly. She ...	1	1	0

Below are two example reviews - the first has no assigned themes while the second has two ("bedside manner" and "communication").

print(data.iloc[0]['text'])

Love her.  Griffin has been seeing her since he was little

print(data.iloc[4]['text'])

Dr. Kury Perez is very nice and friendly. She knows how to explain your health concerns in a simple way. I highly recommend her as a good healthcare provider.

Prepare the Text¶

Let's now apply the pre-processing function to clean the text.

data = data.dropna(subset=['text']).reset_index(drop=True)

data['processed_text'] = data['text'].apply(process_text)

The data are split into training, validation, and test sets with approximately 75% of the data going to the training set.

x_tr, x_te, y_tr, y_te = train_test_split(data['processed_text'], data[target_columns], test_size=3000, random_state=0)

x_tr, x_va, y_tr, y_va = train_test_split(x_tr, y_tr, test_size=3000, random_state=0)

Build the Model¶

A neural network will be built which incorporates pre-trained word vectors from a popular model called Word2Vec. Word2Vec maps words to a dense vector space such that synonyms are found near each other. The variables below allow us to specify how many tokens to use from each review to classify them (maxlen), the total number of tokens in the model vocabulary (max_features), and the length of the Word2Vec embedding vectors (embedding_size). We'll tokenize the reviews and pad them to the same length.

maxlen = 100
max_features = 20000
embedding_size = 300

tokenizer = Tokenizer(num_words=max_features)

tokenizer.fit_on_texts(x_tr)

x_tr = tokenizer.texts_to_sequences(x_tr)
x_te = tokenizer.texts_to_sequences(x_te)
x_va = tokenizer.texts_to_sequences(x_va)

x_tr = pad_sequences(x_tr, maxlen=maxlen)
x_te = pad_sequences(x_te, maxlen=maxlen)
x_va = pad_sequences(x_va, maxlen=maxlen)

The gensim package has a nice way to download the necessary word vectors. In addition to Word2Vec, GloVe and fasttext word embeddings are also available.

[i for i in gensim.downloader.info()['models'].keys()]

['fasttext-wiki-news-subwords-300',
 'conceptnet-numberbatch-17-06-300',
 'word2vec-ruscorpora-300',
 'word2vec-google-news-300',
 'glove-wiki-gigaword-50',
 'glove-wiki-gigaword-100',
 'glove-wiki-gigaword-200',
 'glove-wiki-gigaword-300',
 'glove-twitter-25',
 'glove-twitter-50',
 'glove-twitter-100',
 'glove-twitter-200',
 '__testing_word2vec-matrix-synopsis']

Each word in the review vocabulary can now be embedded.

embedding_index = gensim.downloader.load('word2vec-google-news-300')

word_index = tokenizer.word_index

num_words = min(max_features, len(word_index)) + 1

embedding_matrix = np.zeros((num_words, embedding_size))

for word, i in word_index.items():
    if i >= max_features:
        break

    try:
        embedding_matrix[i] = embedding_index[word]
    except:
        pass

Let's look at the first few embedding values for "doctor".

print(embedding_index['doctor'][:10])

[-0.09326172  0.02734375  0.07958984 -0.01287842 -0.11181641  0.3359375
 -0.13867188 -0.12011719  0.02685547 -0.20410156]

The model will be a bidirectional recurrent neural network built using Keras. The Adam optimizer will be used with binary crossentropy as the loss. The area under the receiver operating characteristic and precision-recall curves will be printed after each epoch. Prior experimentation suggests three epochs is sufficient to optimize model performance.

model = Sequential([

    Embedding(num_words, embedding_size, weights=[embedding_matrix]),

    SpatialDropout1D(0.1),
    Bidirectional(GRU(80, return_sequences=True)),
    GlobalMaxPooling1D(),
    Dense(6, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[AUC(curve='ROC'), AUC(curve='PR')])

history = model.fit(x_tr, y_tr, batch_size=128, epochs=3, validation_data=(x_va, y_va))

Epoch 1/3
145/145 [==============================] - 43s 275ms/step - loss: 0.3532 - auc_2: 0.9109 - auc_3: 0.8399 - val_loss: 0.2552 - val_auc_2: 0.9534 - val_auc_3: 0.9070
Epoch 2/3
145/145 [==============================] - 41s 281ms/step - loss: 0.2277 - auc_2: 0.9641 - auc_3: 0.9260 - val_loss: 0.2292 - val_auc_2: 0.9623 - val_auc_3: 0.9202
Epoch 3/3
145/145 [==============================] - 41s 284ms/step - loss: 0.2007 - auc_2: 0.9719 - auc_3: 0.9402 - val_loss: 0.2313 - val_auc_2: 0.9621 - val_auc_3: 0.9187

The callback suggests that the model produces an ROC AUC of about 0.92 averaged across the six classes. Let's apply the model to our test set and see how we did.

predict = model.predict(x_va)

94/94 [==============================] - 2s 17ms/step

print(classification_report(y_va, predict.round(), zero_division=0))

              precision    recall  f1-score   support

           0       0.79      0.84      0.81      1319
           1       0.82      0.86      0.84      1037
           2       0.85      0.97      0.90      2442
           3       0.93      0.89      0.91       655
           4       0.61      0.14      0.23       144
           5       0.79      0.56      0.66       137

   micro avg       0.83      0.88      0.86      5734
   macro avg       0.80      0.71      0.72      5734
weighted avg       0.83      0.88      0.85      5734
 samples avg       0.83      0.86      0.83      5734

fig, axs = subplots(2, 3, figsize=(12, 7))

fig.subplots_adjust(wspace=0.2, hspace=0.3)

for i, ax in enumerate(fig.axes):

    fpr, tpr, threshold = roc_curve(y_va.iloc[:, i], predict[:, i])

    score = roc_auc_score(y_va.iloc[:, i], predict[:, i])

    ax.plot(fpr, tpr, lw=1)

    ax.plot([0, 1], [0, 1], '--k', lw=1)

    ax.minorticks_on()
    ax.set_ylim(0, 1)
    ax.set_xlim(0, 1)
    ax.set_ylabel('True Positive Rate', fontsize=8)
    ax.set_xlabel('False Positive Rate', fontsize=8)
    ax.tick_params(axis='both', labelsize=8)
    ax.set_title(f'{target_columns[i]}: {score.round(3)}', fontsize=8)

The model reaches an average ROC AUC score of 0.92 across the six themes. Of all the themes, the model has the hardest time understanding when a review references the quality of care received. This is also reflected in the labels given by the Mechanical Turk participants, likely because of the many ways in which the theme can be described. For example, the physician's knowledge or patient recovery time, could pertain to quality of care, and there may be overlap between this theme and "bedside manner" or "communication". Perhaps it would be helpful to break this theme into multiple ones or to combine themes.