Is DeepSeek ready for healthcare? We tested it.
DeepSeek is seen as a threat to OpenAI and other LLMs. Our data on its performance in healthcare begs to differ...
For two weeks now, DeepSeek has haunted the AI world, all of LinkedIn and my stock portfolio. Various predictions were thrown around.
People seemed to agree that DeepSeek had created an extremely performant LLM with very little resources and time, at first glance matching ChatGPT’s offering. Some complained that DeepSeek had just distilled and reverse-engineered OpenAI’s models - either way, the impact is real.
Naturally, I was wondering what DeepSeek means for healthtech. There’s recent data that ChatGPT performs extremely well in medical cases - can DeepSeek match it? Is it even a threat to specialised, fine-tuned LLMs in healthcare?
We decided to run a Litmus test.
To approximate DeepSeek’s performance in healthcare, we tested their DeepThink (R1) vs ChatGPT’s o1 on medical case studies. To push both models to their limits and achieve a differentiated outcome, we selected 50 of the most difficult questions from 2024’s medical state exam (IMPP’s M2) for German doctors. It’s a multiple choice format with complex case descriptions, often requiring multiple logical conclusions. The questions are challenging and sometimes niche - any med student will confirm this - and statistically well tested to differentiate candidates. Further, it allowed us to benchmark against the “best of medical graduates” as an indicator of human performance.
The result?
ChatGPT’s o1 outperformed DeepSeek significantly, with 90% vs 74% accuracy (p=0.033). Interestingly, o1 also seems to beat the collective accuracy of med school graduates (82%), which are arguably at the peak of their theoretical knowledge. To be fair, that difference is not statistically significant at our sample size.
The results might sound abstract, so I’ll give you an idea of where DeepSeek f*cked up in contrary to ChatGPT:
It didn’t conclude the correct side effects and contraindications of local anesthesia for certain patient subtypes (e.g., premature babies)
It prescribed Zink solution instead of a more effective Cortisol cream for a severe skin condition
It wanted to partially remove a patient’s liver (!), falsely suspecting a tumor recurrence
It couldn’t classify the correct tumor stage in a hodgkin lymphoma patient
It couldn’t calculate the strength of spectacle lenses in an ophthalmology case
Judge for yourself - I personally think most of these mistakes are minor. Nonetheless, in a trust-driven and regulated field such as healthcare, they weigh heavily.
I don’t want to hide the experiment’s limitations: The sample size is small, I intentionally selected the cases ranked 4-5 out of 5 in difficulty and the IMPP questions are sometimes absurdly far from clinical reality. The multiple choice format is also not ideal. In addition, we can’t separate the influence of pretraining vs the quality of the reasoning layer on each model’s score. Nonetheless, the experimental design is a cool proxy to capture current model performance at the highest spheres of medical knowledge.
So, what can we learn from this experiment? Let’s speculate:
ChatGPT o1 reasons insanely well, I would trust it over any average doctor (at least in theory - more on that in the next newsletter)
DeepSeek is impressive at the surface, but underperforms OpenAI in expert fields - either its training data or reasoning ability lag behind
As a medical LLM startup, keep being afraid of OpenAI rendering you obsolete, not DeepSeek
In summary, brute force computing power and data still seems to buy you an advantage, at least in healthcare.
What’s your take as an investor or founder? Have you tested DeepSeek already? If you have gathered any data, let’s have a chat.
Happy Tuesday,
Lucas