In my amateur attempt to investigate medical insurance fraud I decided to look at this 2023 dataset here:
Medicare Part D Prescribers - by Provider
https://data.cms.gov/provider-summary-by-type-of-service/medicare-part-d-prescribers/medicare-part-d-prescribers-by-provider/data
Full Jupyter Notebook Analysis:
https://github.com/BigBrotherPronin/medfraudinvestigation
My thinking was that if I were to run some sort of anomaly detecting algorithm like isolation forest, I would be able to find suspicious cases and individuals who are outliers and who may be guilty. I did find outliers but unfortunately they were not guilty. (one is pending further investigation)
Here is what happened:
1. Data Processing
We were dealing with a million plus rows and 84 columns, so I decided to eliminate some incomplete data and convert some columns to numeric types.
# Convert key columns, turning errors into 'NaN' (Not a Number)
numeric_cols = ['tot_clms', 'tot_drug_cst', 'tot_benes', ...]
for col in numeric_cols:
df[col] = pd.to_numeric(df[col], errors='coerce')
# Remove rows with missing essential data
df.dropna(subset=numeric_cols, inplace=True)
After processing the data we got around 414,000 providers with complete columns ready for evaluation.
2. Feature Engineering
Then I decided to engineer a couple more datapoints upon which I would use the isolation forest algorithm and try to find outliers based on these newly created metrics.
# Creating primary investigative metrics
df['claims_per_beneficiary'] = df['tot_clms'] / df['tot_benes']
df['cost_per_claim'] = df['tot_drug_cst'] / df['tot_clms']
df['opioid_claim_rate'] = df['opioid_tot_clms'] / df['tot_clms']
df['brnd_tot_clms'] = df['brnd_tot_clms'].fillna(0)
df['brand_claim_rate'] = df['brnd_tot_clms'] / df['tot_clms']
3. Anomaly Detection Results
After running the Isolation forest algorithm on these metrics we flagged over 12k anomalies, who were clearly different from their peers:
Metric | Anomaly | Normal |
---|---|---|
avg_claims_per_beneficiary | 12.71 | 5.18 |
avg_cost_per_claim | $3,236.40 | $150.59 |
avg_patient_risk_score | 2.21 | 1.55 |
most_common_specialty | Hem-Onc | NP |
Since the average patient risk score was higher in the anomaly group we are assured that these anomalies are not just fraud but most surely are just prescribers who work in fields which prescribe expensive drugs. This was proven correct in my next checks.
4. False Positives
Looking at the top anomalies at the cost per claim, I checked the prescribers with the highest costs per claims and the lowest risk scores, which is where I thought I found my smoking gun.
Ngoc Trieu and Ada Noh were optometrists with patient risk scores around 1.00 but had insanely high costs per claim over $100k and over $30k per claim respectively. However when I cross referenced their prescriptions in this database:
I found that they were prescribing Oxervate, a legitimately expensive and rare drug. There goes my smoking gun.
5. The Suspicious Case
I decided to look at a new angle, after running Isolation forest on the opioid prescription rate I found a prescriber who had a 95.6% opioid prescription rate. Dr Donna Tafor (NPI 1780694190), a pediatrician in Georgia. First of all that's an extremely high opioid prescription rate, especially for a pediatrician.
When checking her prescriptions on the database above I found she was prescribing Morphine Sulfate (extended release). I did some research trying to find if she may have a specialty in oncology to justify this but found no such thing, and according to her profile she mostly specializes in common pediatric things like infections and sore throats. So this is weird to me.
Dr. Tafor (NPI 1780694190) Summary
Conclusion
I did this exercise for fun. Whether Donna Tafor is guilty is yet to be found. But this data regarding her is extremely suspicious.
I will be contacting the Georgia composite medical board and US department of health and human service to see what happened here.