Introduction
Today, we’re excited to announce the open-source release of TextChunker! This library empowers Elixir developers to break down large text documents into meaningful semantic chunks for various AI-powered tasks, like Retrieval Augmented Generation (RAG).
Why did Revelry build TextChunker?
We’ve been building an Elixir app that uses RAG to enhance an AI integration inside a new product we’re developing. We found ourselves in need of a robust text chunking solution, specifically a solution optimized for RAG. Strategies and libraries for intelligent text chunking can be found in Python and JavaScript in projects like LangChain, and so we created TextChunker to fill a gap in the Elixir ecosystem. We’re excited to share it with the broader Elixir community!
(If you’re interested in our AI Elixir project (built to leverage AI throughout your software delivery cycle), visit us at https://prodops.ai/.)
What is TextChunker?
TextChunker is an Elixir library specifically designed to optimize text segmentation for use with vector databases and RAG applications.
Why Use TextChunker for Retrieval Augmented Generation (RAG)?
Because with RAG, simply splitting text won’t cut it. LLMs need contextually relevant and useful chunks of text to produce the best outputs. To get that, you need intelligently split boundaries based on the format of the text in question (plaintext, markdown, code), as well as a flexible way of providing overlap, and desired chunk size.
Key Features
- Semantic chunking: TextChunker goes beyond basic chunking by intelligently segmenting text based on separators that are meaningful to the specified format (e.g., headings and paragraphs in Markdown). Semantic separators supported out of the box include:
- Plaintext
- Markdown
- Elixir code
- Ruby code
- PHP
- Vue
- JavaScript
- Note: if you don’t see your favorite format here, don’t worry: we’ve built TextChunker to be flexible when it comes to supporting new formats
 
- Configurable chunking: Fine-tune the process with options to control chunk size, overlap, and data format. It should be fairly trivial to integrate new strategies of chunking in the future, too.
- Metadata tracking: Maintain the integrity of your original text with automatic generation of Chunk structs containing byte range information.
Get Started with TextChunker
Install TextChunker from Hex:
def deps do
 [
  {:text_chunker, "~> 0.1"}
 ]
endFind detailed documentation on our GitHub repository: https://github.com/revelrylabs/text_chunker_ex or Hex: https://hex.pm/packages/text_chunker.
Join the Community
We believe open-source projects thrive with collaboration. Contribute to TextChunker’s development, share use cases, or report issues on our GitHub repository.
Let’s Talk AI
As software engineers, we are duty bound to continue learning, and there’s nothing more exciting right now than learning about artificial intelligence. Here are some other articles exploring its mysteries that our engineering team members have published:
- Memory Consumption and Limitations in LLMs with Large Context Windows
- Memory Consumption and Limitations in LLMs with Large Context Windows, Pt II
- Comparing OpenAI’s Assistants API, Custom GPTs, and Chat Completion API
- Creating an “Agent” Using OpenAI’s Functions API
- Our Journey: Building with Generative AI, Part I
