Back to Prindle Institute

Hotline Ping: Chatbots as Medical Counselors?

photograph of stethoscope wrapped around phone

In early 2021, the Trevor Project — a mental health crisis hotline for LGBTQIA+ youths — made headlines with its decision to utilize an AI chatbot as a method for training counselors to deal with real crises from real people. They named the chatbot “Riley.” The utility of such a tool is obvious: if successful, new recruits could be trained at all times of day or night, trained en masse, and trained to deal with a diverse array of problems and emergencies. Additionally, training workers on a chatbot greatly minimizes the risk of something going wrong if someone experiencing a severe mental health emergency got connected with a brand-new counselor. If a new trainee makes a mistake in counseling Riley, there is no actual human at risk. Trevor Project counselors can learn by making mistakes with an algorithm rather than a vulnerable teenager.

Unsurprisingly, this technology soon expanded beyond the scope of training counselors. In October of 2021, the project reported that chatbots were also used to screen youths (who contact the hotline via text) to determine their level of risk. Those predicted to be most at-risk, according to the algorithm, are put in a “priority queue” to reach counselors more quickly. Additionally, the Trevor Project is not the only medical/counseling organization utilizing high-tech chatbots with human-like conversational abilities. Australian clinics that specialize in genetic counseling have recently begun using a chatbot named “Edna” to talk with patients and help them make decisions about whether or not to get certain genetic screenings. The U.K.-based Recovery Research Center is currently implementing a chatbot to help doctors stay up-to-date on the conditions of patients who struggle with chronic pain.

On initial reading, the idea of using AI to help people through a mental or physical crisis might make the average person feel uncomfortable. While we may, under dire circumstances, feel okay about divulging our deepest fears and traumas to an empathetic and understanding human, the idea of typing out all of this information to be processed by an algorithm smacks of a chilly technological dystopia where humans are scanned and passed along like mere bins of data. Of course, a more measured take shows the noble intentions behind the use of the chatbots. Chatbots can help train more counselors, provide more people with the assistance they need, and identify those people who need to reach human counselors as quickly as possible.

On the other hand, big data algorithms have become notorious for the biases and false predictive tendencies hidden beneath a layer of false objectivity. Algorithms themselves are no more useful than the data we put into it. Chatbots in Australian mental health crisis hotlines were trained by analyzing “more than 100 suicide notes” to gain information about words and phrases that signal hopelessness or despair. But 100 is a fairly small amount. On average, there are more than 130 suicides every day in the United States alone. Further, only 25-30% of people who commit suicide leave a note at all. Those who do leave a note may be having a very different kind of mental health crisis than those who leave no note, meaning that these chatbots would be trained to only recognize clues present in (at best) about a quarter of successful suicides. Further, we might worry that stigma surrounding mental health care in certain communities could disadvantage teens that already have a hard time accessing these resources. The chatbot may not have enough information to recognize a severe mental health crisis in someone who does not know the relevant words to describe their experience, or who is being reserved out of a sense of shame.

Of course, there is no guarantee that a human correspondent would be any better at avoiding bias, short-sightedness, and limited information than an algorithm would be. There is, perhaps, good reason to think that a human would be much worse, on average. Human minds can process far less information, at a far slower pace, than algorithms, and our reasoning is often imperfect and driven by emotions. It is easy to imagine the argument being made that, yes, chatbots aren’t perfect, but they are much more reliable than a human correspondent would be.

Still, it seems doubtful that young people would, in the midst of a mental health crisis, take comfort in the idea of typing their problems to an algorithm rather than communicating them to a human being. The facts are that most consumers strongly prefer talking with humans over chatbots, even when the chatbots are more efficient. There is something cold about the idea of making teens — some in life-or-death situations — make it through a chatbot screening before being connected with someone. Even if the process is extremely short, it can still be jarring. How many of us avoid calling certain numbers just to avoid having to interact with a machine?

Yet, perhaps a sufficiently life-like chatbot would neutralize these concerns, and make those who call or text in to the hotline feel just as comfortable as if they were communicating with a person. Research has long shown that humans are able to form emotional connections with AI extremely quickly, even if the AI is fairly rudimentary. And more seem to be getting comfortable with the idea of talking about their mental health struggles with a robot. Is this an inevitable result of technology becoming more and more a ubiquitous part of our lives? Is it a consequence of the difficulty of connecting with real humans in our era of solitude and fast-paced living? Or, maybe, are the robots simply becoming more life-like? Whatever the case may be, we should be diligent that these chatbots rely on algorithms that help overcome deep human biases, rather than further ingrain them.

The Insufficiency of Black Box AI

image of black box spotlighted and on pedestal

Google and Imperial College London have collaborated in a trial of an AI system for diagnosing breast cancer. Their most recent results have shown that the AI system can outperform the uncorroborated diagnosis of a single trained doctor and perform on par with pairs of trained diagnosticians. The AI system was a deep learning model, meaning that it works by discovering patterns on its own by being trained on a huge database. In this case the database was thousands of mammogram images. Similar systems are used in the context of law enforcement and the justice system. In these cases the learning database is past police records. Despite the promise of this kind of system, there is a problem: there is not a readily available explanation of what pattern the systems are relying on to reach their conclusions. That is, the AI doesn’t provide reasons for its conclusions and so the experts relying on these systems can’t either.

AI systems that do not provide reasons in support of their conclusions are known as “black box” AI. In contrast to these are so-called “explainable AI”. This kind of AI system is under development and likely to be rapidly adopted within the healthcare field. Why is this so? Imagine visiting the doctor and receiving a cancer diagnosis. When you ask the doctor, “Why do you think I have cancer?” they reply only with a blank stare or reply, “I just know.” Would you find this satisfying or reassuring? Probably not, because you have been provided neither reason nor explanation. A diagnosis is not just a conclusion about a patient’s health but also the facts that lead up to that conclusion. There are certain reasons that the doctor might give you that you would reject as reasons that can support a cancer diagnosis.

For example an AI designed at Stanford University system being trained to help diagnosis tuberculosis used non-medical evidence to generate its conclusions. Rather than just taking into account the images of patients’ lungs, the system used information about the type of X-ray scanning device when generating diagnoses. But why is this a problem? If the information about what type of X-ray machine was used has a strong correlation with whether a patient  has tuberculosis shouldn’t that information be put to use? That is, don’t doctors and patients want to maximize the number of correct diagnoses they make? Imagine your doctor telling you, “I am diagnosing you with tuberculosis because I scanned you with Machine X, and people who are scanned by Machine X are more likely to have tuberculosis.” You would not likely find this a satisfying reason for a diagnosis. So if an AI is making diagnoses based on such facts this is a cause for concern.

A similar problem is discussed in philosophy of law when considering whether it is acceptable to convict people on the basis of statistical evidence. The thought experiment used to probe this problem involves a prison yard riot. There are 100 prisoners in the yard, and 99 of them riot by attacking the guard. One of the prisoners did not attack the guard, and was not involved in planning the riot. However there is no way of knowing specifically of each prisoner whether they did, or did not, participate in the riot. All that is known that 99 of the 100 prisoners participated. The question is whether it is acceptable to convict each prisoner based only on the fact that it is 99% likely that they participated in the riot.

Many who have addressed this problem answer in the negative—it is not appropriate to convict an inmate merely on the basis of statistical evidence. (However, David Papineau has recently argued that it is appropriate to convict on the basis of such strong statistical evidence.) One way to understand why it may be inappropriate to convict on the basis of statistical evidence alone, no matter how strong, is to consider the difference between circumstantial and direct evidence. Direct evidence is any evidence which immediately shows that someone committed a crime. For example, if you see Robert punch Willem in the face you have direct evidence that Robert committed battery (i.e., causing harm through touch that was not consented to). If you had instead walked into the room to see Willem holding his face in pain and Robert angrily rubbing his knuckles, you would only have circumstantial evidence that Robert committed battery. You must infer that battery occurred from what you actually witnessed.

Here’s the same point put another way. Given that you saw Robert punch Willem in the face, there is a 100% chance that Robert battered Willem—hence it is direct evidence. On the other hand, given that you saw Willem holding his face in pain and Robert angrily rubbing his knuckles, there is a 0% – 99% chance that Robert battered Willem. The same applies to any prisoner in the yard during the riot: given that they were in the yard during the riot, there is at best a 99% chance that the prisoner attacked the guard. The fact that a prisoner was in the yard at the time of the riot is a single piece of circumstantial evidence in favor of the conclusion that that prisoner attacked the guard. A single piece of circumstantial evidence is not usually taken to be sufficient to convict someone—further corroborating evidence is required.

The same point could be made about diagnoses. Even if 99% of people examined by Machine X have tuberculosis, simply being examined by Machine X is not a sufficient reason to conclude that someone has tuberculosis. Not reasonable doctor would make a diagnosis on such a flimsy basis, and no reasonable court would convict someone on the flimsy basis in the prison yard riot case above. Black box AI algorithms might not be basing diagnoses or decisions about law enforcement on such a flimsy basis. But because this sort of AI system doesn’t provide its reasons, there is no way to tell what makes its accurate conclusions correct, or its inaccurate conclusions incorrect. Any domain like law or medicine where the reasons that underlie a conclusion are crucially important is a domain in which explainable AI is a necessity, and in which black box AI must not be used.

Is it Fair to Blame President Trump’s Behavior on Mental Illness?

A dark photo of Donald Trump clapping.

On October 25, former Oklahoma Senator Tom Coburn (a Republican) said that President Trump has “a personality disorder.”  He was not the first to posit that President Trump has some form of mental illness. The press has been engaging with such speculation since the start of his campaign, though there has been a decided increase of late. On October 26, New York Times columnist David Brooks reported that some Republican senators thought Trump is “suffering from early Alzheimer’s.” In an article titled “Some Republicans are starting to more openly question Trump’s Mental health,” Business Insider reports that “One psychiatric professor at Yale said about half a dozen lawmakers had contacted her over the past several months.”

Continue reading “Is it Fair to Blame President Trump’s Behavior on Mental Illness?”