San Francisco, CA
empathie labs
support@empathielabs.com
ABOUT
Learn continuously like humans do. Empathie enables you to frequently add to and improve your large language model's capabilities. Open source models are capable of exceeding frontier foundation models within specialized domains. Your model is portable, fast & cost effective due to it's lower compute needs and your ability to deploy one multi-task model to handle a wide diversity of tasks, a problem previously addressed by deploying several specialized models or using long complex instruction prompts. We have enhanced open source pre-trained large language models, allowing these models to adapt over time from continuous feedback. Our customers own the models we help them train, they benefit from the privacy, new capabilities, lower costs and complete control of their AI stack.

Contact us at: support@empathielabs.com

Continual Learning
Large language models are valuable because one model can understand from the prompt which of many tasks the user is asking for, they are multi-task. When we finetune large language models they improve at one skill at the expense of previously learned skills. This skill degredation is called catastrophic forgetting and it is the reason we spend so much money on compute retraining models from scratch rather than building on top of the skills our model has already spent so much time learning during pretraining, posttraining and finetunes.
Background: The model in the plot is a transformer LLM and the 2 skills A & B are generative tasks requiring several output tokens. 1. With standard fine-tuning (purple) there is a 40% drop in Skill A quality that's still worsening as Skill B improves. 2. Continual learning allows one model to reach desired quality for both skills (blue). Without continual learning, skill A and B have different baseline performances and different rates of learning progress. 3. We can learn Skill B after deleting Skill A data, the datasets have no overlap. During update 1 - 20 the large language model is only training on Skill A. During update 21 - 60 the model is only training on Skill B.
Medical Record Categorization & Translation
Customer A is a Data as a Service Company leveraging access to exclusive electronic health records (EHR) data to deliver up to date real world clinical insights to clients, driving business decisions for development and commercialization strategy. To provide actionable evidence backed answers to their client's questions, Customer A partnered with us to train a model that could translate noisy raw unstructured data from a variety of EHR formats and output the relevant clinical findings & treatment plan in a structured json format of standardized categories and rationale for each clinical decision. Due to the sensitive nature of Customer A's data asset, all data needed to stay internal, prohibiting use of frontier models like ChatGPT.
In our first collaboration we leveraged the pretrained skills of large language models to finetune a 2B parameter model with 8K context using fewer than 500 examples to ignore irrelevant text, handle redundancy and categorize only relevant text (highlighted color coded lines, sensitive facts have been modified for privacy) from the raw data. We measured our efficacy using medical statistics including precision, recall and F1 score. In our second collaboration with Customer A, we maximized the training utility of new clinician labeled medical records and further finetuned the same model from our first collaboration to translate the categorized text into the clinician's rationale for each treatment decision in laymens terms. Although Customer A no longer had access to data from the first collaboration, we were still able to train the model in such a way as to build on the categorization task rather than replace it, resulting in a multi-task model with excellent precision, recall and F1 score in both tasks of categorization and translation. The resulting model is much lower cost to deploy than frontier models like ChatGPT, and owned exclusively by Customer A for their secure private internal use on sensitive data. Training Compute Costs: $7.50/hr x 2 weeks = $2500 Data Processing Costs: 80M tokens x $10/1M tokens = $800 Engineering Costs: $700 Total: $4000
Multi-task Chatbot
Customer B connects their customers to chatbots via phone, augmented reality, tablet, or browser to serve as remote workers, NPCs, assistants, companions, or anything clients need their chatbots to be. Their pain points are:
  • The unwillingness of models by large tech companies to even visit certain topics or to speak not as a formal AI assistant
  • The engineering complexity of managing several models to perform various tasks in production such as conversation generation, safety detection, response quality monitoring, etc.
  • The latency of responses from large model APIs and self-hosted models even with inference time optimizations
  • The long and ever growing input prompts needed for entire chat history to stay in memory helping the chatbot to stay self-consistent over multiple turns
  • To address censoring, we fine-tuned the model to speak in a more personal manner, ask questions and delve more deeply into disallowed topics for the purpose of clarifying but still not giving medical treatment advice nor making un-ethical statements.
  • To address the engineering complexity of managing several models, we finetuned one model to switch tasks controlled by the use of custom instructions and symbols to indicate which mode the LLM is working in. On the left is a comprehensive dialogue prompt used to instruct the model to act as a chatbot obeying the behavior described in the system prompt and staying consistent with the chat history leading up to the generation prompt. On the right is a classification prompt used to classify the conversational transcript thus far into one of several categories.
  • To address latency we shorted the input prompt and model size, but finetuned the model using an augmented version of Customer B's internal data to generate the desired output using these more concise instruction prompts.
  • To address the increasing costs of growing chat histories, we added yet another mode to the same multi-task generation and preference model. The new mode summarized conversations into such a format that could be added to the system prompt instead of the using the original full conversation transcript. We then finetuned the model to optimally use the summary portion of the prompt to generate chat responses self-consistent with the context of the beginning, middle and recent conversation.
The resulting model is being used to analyze the user's and the model's generations before responding. The stewardship mode monitors the model's responses for being helpful and compelling. The consistency classifier grades the model on it's coherence to the entire conversation thus far as opposed to hallucinating or drifting off topic. The guadianship classifier checks if the user is in danger and if the model response is adherent to a custom set of rules customized by Customer B. Training Compute Costs: $18/hr x 2 weeks = $6000 Data Processing Costs: 500M tokens x $10/1M tokens = $5000 Engineering Costs: $4000 Total: $15,000
Updating Knowledge & Skills
Popular methods to address hallucination and out of date knowledge are retrieval augmented generation (RAG) and function calling. A known limitation of LLMs stem from their pre-training on a corpus of data fixed in time. Without augmentation, an LLM will have no knowledge of events that occurred after its pre-training or of information not scraped for its pre-training such as private data, thus methods such as RAG or function calling are used to insert some hopefully useful text into the prompt to compensate. In RAG, the user question is embedded to perform a vector similarity search across a set of documents to retrieve text that can be inserted into the prompt. Function calling involves using the model’s output to query an API for the latest news and insert that into the prompt. Customer C uses empathie to update their finance LLM weekly with the most recent financial reports, they are particularly sensitive to hallucinations and prefer the model to simply reply "I dont know" when the relevant information is neither contained in the RAG injected text or in the weights.
It is not clear how to best inject new unstructured knowledge into prompts or how to encourage LLMs to gracefully compensate for failed document retrieval. For example, the document containing the title of an article might be mistakenly retrieved, but the body of the article is on the next document.
Empathie lab's technology allows for frequent repeated additive finetuning updates to the same LLM. These updates both teach LLMs how to appropriately use injected information and also updates the weights with recent events to allow the model to smoothly transition between using knowledge encoded in it's parameters and knowledge injected into it's prompt. Training Compute Costs: $4/hr x 1 day = $100 Data Processing Costs: 2M tokens x $10/1M tokens = $20 Engineering Costs: $80 Total: $200 per update