Introduction¶
Numerous platforms exist today for patients to review and compare healthcare providers, allowing them to make more informed decisions about their medical care. While the majority of these reviews are positive in tone, a large proportion of them are negative. The negative reviews sometimes contain content which violates platform policy, necessitating their removal.
Several commercial sentiment models exist which can be used to help moderate submitted reviews. Two popular choices are Alchemy, part of the IMB Watson suite of tools, and Indico. Once identified by these tools, negative reviews can be escalated to human moderators for further evaluation. Alchemy works reasonably well but is expensive, resulting in large yearly subscription fees. Additionally, the relatively high proportion of false positives generated by Alchemy leads to unnecessary work for the moderators. Indico offers better performance than Alchemy in the context of healthcare reviews, but is also very costly.
A model can be built which performs as well as Indico but capitalizes on existing data sources. This has the benefit of reducing the false positive rate while removing subscription-related expenses. An analysis will be conducted to show the performance of this new model against Alchemy and Indico.
Let's get started by importing some useful libraries and reading in the data.
%config InlineBackend.figure_format = 'retina'
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline
from matplotlib.pyplot import *
import seaborn as sns
import pandas as pd
import numpy as np
import json
pd.set_option('display.max_columns', 500)
The functions below are used to clean the reviews. Punctuation and special characters are be removed, and some common abbreviations are replaced to standardize the text.
def process_text(text):
result = text.lower()
result = re.sub(r'\n', ' ', result)
result = re.sub(r'\r', ' ', result)
result = re.sub(r'[^\x00-\x7F]+', ' ', result)
result = re.sub(r'[\W]+', ' ', result)
return result.strip()
def replace_abbreviations(text):
result = text.lower()
result = re.sub(r'\byrs\b', 'years', result)
result = re.sub(r'\bhr\b', 'hours', result)
result = re.sub(r'\bhrs\b', 'hours', result)
result = re.sub(r'\bmin\b', 'minutes', result)
result = re.sub(r'\bmins', 'minutes', result)
result = re.sub(r'\bdr\b', 'doctor', result)
result = re.sub(r'\bdoc\b', 'doctor', result)
result = re.sub(r'\bapt\b', 'appointment', result)
result = re.sub(r'\bappt\b', 'appointment', result)
result = re.sub(r' +', ' ', result).strip()
return result
Let's load in the review data.
Read in Provider Reviews¶
provider_reviews = pd.read_csv('./data/provider_reviews.csv', encoding='ISO-8859-1')
provider_reviews.rename(columns={'review-text-cleaned': 'review_text'}, inplace=True)
provider_reviews = provider_reviews[provider_reviews['review_text'] != ''].reset_index(drop=True)
provider_reviews['rating'] = provider_reviews['rating'].replace({1: 0, -1: 1})
ax = provider_reviews['rating'].value_counts(normalize=True).plot.bar(width=0.8, rot=0, figsize=(5, 4))
ax.minorticks_on()
ax.yaxis.set_tick_params(which='minor', bottom=True)
ax.xaxis.set_tick_params(which='minor', bottom=False)
ax.set_ylim(0, 1)
ax.set_ylabel('Fraction of Reviews', fontsize=8)
ax.set_xticklabels(['Positive', 'Negative'], fontsize=8);
ax.tick_params(axis='y', labelsize=8)
The data are imbalanced with approximately 70% of reviews being labeled as having positive sentiment and the remaining 30% labeled as having negative sentiment. The data can be balanced by selecting an additional sample of negative reviews previously rejected for violating platform policy.
Read in Rejected Reviews¶
rejected_reviews = pd.read_csv('./data/rejected_reviews.csv', encoding='ISO-8859-1')
rejected_reviews = rejected_reviews[(rejected_reviews['RejectionReason'] == 'Violates Site Policy') & (rejected_reviews['SentimentScore'] <= -0.5)]
rejected_reviews = rejected_reviews[rejected_reviews['CommentText'] != ''].dropna(subset=['CommentText']).reset_index(drop=True)
rejected_reviews.rename(columns={'CommentText': 'review_text'}, inplace=True)
rejected_reviews['rating'] = 1
The two datasets can be combined to create one final training dataset.
tmp = pd.concat([provider_reviews[['review_text', 'rating']], rejected_reviews[['review_text', 'rating']]], axis=0).reset_index(drop=True)
negative = tmp[tmp['rating'] == 1]
positive = tmp[tmp['rating'] == 0].sample(n=negative.shape[0], random_state=0)
training_dataset = pd.concat([positive, negative], axis=0).reset_index(drop=True)
Lastly, a test dataset is read in to provide an unbiased estimate of model performace for the three models.
Read in Test Data¶
test_dataset = pd.read_csv('./data/hgdata.csv', encoding='ISO-8859-1')
test_dataset.rename(columns={'comment_text': 'review_text',
'AlchemyBinary': 'alchemy_rating',
'IndicoBinary': 'indico_rating',
'MturkBinarySent': 'mturk_rating'}, inplace=True)
fig, axs = subplots(1, 2, figsize=(10, 4))
(-1 * test_dataset['AlchemyScore']).plot.hist(density=True, title='Alchemy', ax=axs[0], fontsize=8, bins=50)
(1 - test_dataset['IndicoScore']).plot.hist(density=True, title='Indico', ax=axs[1], fontsize=8, bins=50)
axs[0].minorticks_on()
axs[0].set_title('Alchemy', fontsize=8)
axs[0].set_ylabel('Frequency', fontsize=8)
axs[0].set_xlabel('Sentiment Score', fontsize=8)
axs[1].minorticks_on()
axs[1].set_title('Indico', fontsize=8)
axs[1].set_ylabel('Frequency', fontsize=8)
axs[1].set_xlabel('Sentiment Score', fontsize=8)
Text(0.5, 0, 'Sentiment Score')
The figures above show the Alchemy and Indico sentiment scores on the test dataset. Alchemy returns a wide distribution of values with many hovering around probability 0. Indico, on the other hand, is much more certain in its classifications with most values near 0 or 1.
The reviews can now be cleaned and standardized.
training_dataset['review_text'] = training_dataset['review_text'].apply(lambda x: replace_abbreviations(process_text(x)))
test_dataset['review_text'] = test_dataset['review_text'].apply(lambda x: replace_abbreviations(process_text(x)))
Build the Model¶
A logistic regression model performs well for this task and produces well-calibratd probability estimates. A pipeline object is created which automatically vectorizes the text using term frequency-inverse document frequency (TF-IDF), and passes the results to the model. A grid search is then performed over a set of TF-IDF hyperparameters to find the best-performing combination that performs the best.
param_grid = {
'vec__ngram_range': [(1, 1), (1, 2), (2, 2)],
'vec__stop_words': ['english', None],
'vec__use_idf': [True, False],
'vec__norm': ['l1', 'l2', None],
'vec__sublinear_tf': [True, False],
'vec__max_features': [5000, 10000, 15000, 30000]
}
pipeline = Pipeline([
('vec', TfidfVectorizer()),
('clf', LogisticRegression(solver='liblinear', max_iter=200))
])
grid_cv = GridSearchCV(estimator=pipeline, param_grid=param_grid, scoring='roc_auc', cv=10, n_jobs=5)
grid_cv.fit(training_dataset['review_text'], training_dataset['rating']);
- Best TF-IDF Hyperparameters
- max_features: 15,000
- ngram_range: (1, 2)
- norm: 'l2'
- stop_words: None
- sublinear_tf: True
- use_idf: True
The regularization parameter for the logistic regression model was not included in the previous hyperparameter search because of time and resource constraints. A second grid search can be performed over this parameter now.
param_grid = {'clf__C': np.logspace(-1, 1, 20)}
model = GridSearchCV(estimator=grid_cv.best_estimator_, param_grid=param_grid, scoring='roc_auc', cv=10, n_jobs=5)
model.fit(training_dataset['review_text'], training_dataset['rating']);
- Best Model Parameter
- C: 3.793
Cross-validation suggests the best model achieves an ROC AUC score of 0.98.
Apply the Model to Test Data¶
The trained model can now be evaluated by applying it to the test data. The results are also compared to the Alchemy and Indico models.
test_dataset['predict'] = model.best_estimator_.predict(test_dataset['review_text'])
test_dataset['probability'] = model.best_estimator_.predict_proba(test_dataset['review_text'])[:, 1]
cm_alc = confusion_matrix(test_dataset['mturk_rating'], test_dataset['alchemy_rating']).T
cm_ind = confusion_matrix(test_dataset['mturk_rating'], test_dataset['indico_rating']).T
cm_mod = confusion_matrix(test_dataset['mturk_rating'], test_dataset['predict']).T
cm_alc = cm_alc.astype('float')/cm_alc.sum(axis=0)
cm_ind = cm_ind.astype('float')/cm_ind.sum(axis=0)
cm_mod = cm_mod.astype('float')/cm_mod.sum(axis=0)
fig, axs = subplots(nrows=1, ncols=3, figsize=(16, 4))
classes = ['+ sentiment', '- sentiment']
tick_marks = np.arange(len(classes))
sns.heatmap(cm_alc, annot=True, cmap='Reds', vmin=0, vmax=1, ax=axs[0])
axs[0].set_xlabel('True Label', fontsize=8)
axs[0].set_ylabel('Predicted Label', fontsize=8)
axs[0].set_xticks(tick_marks + 0.5, classes, fontsize=8)
axs[0].set_yticks(tick_marks + 0.5, classes, fontsize=8)
axs[0].set_title('Alchemy', fontsize=8)
axs[0].figure.axes[-1].tick_params(labelsize=8)
sns.heatmap(cm_ind, annot=True, cmap='Reds', vmin=0, vmax=1, ax=axs[1])
axs[1].set_xlabel('True Label', fontsize=8)
axs[1].set_ylabel('Predicted Label', fontsize=8)
axs[1].set_xticks(tick_marks + 0.5, classes, fontsize=8)
axs[1].set_yticks(tick_marks + 0.5, classes, fontsize=8)
axs[1].set_title('Indico', fontsize=8)
axs[1].figure.axes[-1].tick_params(labelsize=8)
sns.heatmap(cm_mod, annot=True, cmap='Reds', vmin=0, vmax=1, ax=axs[2])
axs[2].set_xlabel('True Label', fontsize=8)
axs[2].set_ylabel('Predicted Label', fontsize=8)
axs[2].set_xticks(tick_marks + 0.5, classes, fontsize=8)
axs[2].set_yticks(tick_marks + 0.5, classes, fontsize=8)
axs[2].set_title('New Model', fontsize=8)
axs[2].figure.axes[-1].tick_params(labelsize=8)
The figures above show the confusion matrices for the three models: Alchemy (left), Indico (middle), and the new model (right). The $x$-axis of each figure represents the actual classification label, while the $y$-axis represents the model's prediction. Alchemy correctly identifies negatvive reviews 96% of the time and positive reviews 86% of the time. However, its relatively high false positive rate of 14% leads to large numbers of reviews being sent for human moderation.
In contrast, the Indico model results in a slightly higher false negative rate, but a significantly lower false positive rate, making it an overall superior model. Still, it must be remembered that Indico's improved performance comes with a high yearly subscription cost.
Lastly, the new model correctly classifies negative reviews 97% of the time and positive reviews 94% of the time. This is slight overall improvement compared to Indico and a reduction of 60% in false positives relative to Alchemy.
fig, axs = subplots(1, 3, figsize=(15, 4))
(-1 * test_dataset['AlchemyScore']).plot.hist(density=True, ax=axs[0], fontsize=8, bins=50)
(1 - test_dataset['IndicoScore']).plot.hist(density=True, ax=axs[1], fontsize=8, bins=50)
test_dataset['probability'].plot.hist(density=True, ax=axs[2], fontsize=8, bins=50)
model_name = ['Alchemy', 'Indico', 'New Model']
for i, j in enumerate(axs):
axs[i].minorticks_on()
axs[i].set_title(model_name[i], fontsize=8)
axs[i].set_ylabel('Frequency', fontsize=8)
axs[i].set_xlabel('Sentiment Score', fontsize=8)
Finally, the figures above compare the clasisfication probabilities from each of the three models. The sentiment distribution from the new model looks very similar to that of Indico.