Langchain Summarization App
Introduction
After the discoveries and insights from the Summarization Suite, I developed a more robust system in order to combine summarization capabilities across text, audio and video formats in a single application, this time deployed as an API service.
In this project the different summarization operations performed over different the media formats were combined using the Langchain framework along with many of its integrations with other services and libraries. The final solution consisted of a RESTful API served via FastAPI with batching/streaming capabilities interacting with a micro-service-based system deployed in Podman containers. Additionally, all LLM operations, including summarization tasks were run locally within GPU-capable containers.
Apart from the summarization operations, a simple Feedback machanism was implemented to collect via the REST API feedbacks for the generated summaries.
Highlights
- Podman as container engine for micro-service based deployment
- Streaming capabilities directly via Langchain
- LLM deployment with Ollama in GPU capable containers (powered by nvidia container toolkit)
- Redis for LLM response caching
- MongoDB for storage solution, storing summaries, metadata and feedbacks
- FastAPI as web framework
Input Formats
The two main formats processed by the application include text and audio. Langchain makes it easy to
extract the content of a myriad of file formats using its Document Loaders
, which act as an initial step in the processing pipeline. The Summarization Service class dynamically
selects the appropriate Loader based on the file type, identified using libmagic
. Consequently,
all files submitted to the service are uploaded as raw binary data.
Below is an example of an API call in which some file file.pdf
is uploaded to appropriate endpoint
using POST requests.
$ curl -X POST "http://0.0.0.0:8000/summarize"
\ -F "file=@file.pdf;type=application/octet-stream"
\ --no-buffer
langchain-community
package).
This PR provides further details and rationale for the integration. For now, suffices to know
that audio transcription is handled locally but operates independently of the summarization
environment.Architecture
The diagram below depics the overall architecture of the application; gray boxes indicate the containers, while the black boxes indicate the actual services run in each container, dotted lines indicate optional communications between services.
flowchart LR subgraph Model["LLM (GPU)"] direction LR O["Ollama"] end subgraph Cache["Cache Service"] R[("Redis")] end subgraph Service["Summarization Service"] FastAPI H["Langchain
Service"] Loader end subgraph Tr["Transcription (GPU)"] T["Faster Whisper"] end subgraph D["Storage Service"] direction LR mongo[(MongoDB)] end User --> FastAPI --> H H ---> mongo H ---> O <-.-> R H <--> Loader <-.-> T
Containers & Components
The central container, “Summarization Service”, orchestrates the processing of documents forwarded by the FastAPI server. It extracts the content with the appropriate Document Loader, sends the extracted text over to the LLM service and stores the received summary (and its metadata) in the Storage Service it is connected to. All commmunication between the services is performed over the podman-defined network to which all services are connected.
The LLM service runs an instance of Ollama in a GPU capable container using the nvidia container toolkit. This service allows us easily change the LLM used in the application while optimizing hardware utilization as well as other aspects like the number of concurrent inference processes, queue size limits and the simultaneous loading of multiple models.
The Cache Service uses an instance of Redis integrated with LangChain’s LLM abstraction via the Redis Langchain integration. This integration eliminates the need for custom caching logic while preventing redundant calls to the LLM which is particularly beneficial when using external (paid) LLM services.
The Transcription Service operates in another GPU-capable container (using a different version of CUDA)
running a faster-whisper-server
instance,
a containerized runtime for faster-whisper
which is integrated
with Langchain using the custom Document Loader
mentioned earlier. The decoupling of this service prevents us from adding GPU-heavy dependencies to
the “Summarization Service” container and also simplifies swapping the transcription service, as
it only requires updating the Langchain integration for handling video and audio content.
The Storate Service runs a MongoDB instance responsible for storing:
- The generated summaries
- The origingal document as raw binary (i.e. as received by the application)
- LLM generation metadata and other service parameters
- Feedbacks received for each summary.
Some integrations from Langchain could have been used, but given the specialized nature of the storage schema, an interface for the application was developed in order to abstract the interactions with the database (and also making it simple to swap the storage service if needed).
Execution Flow
The follownig steps describe a high level overview of the operations performed on each document submitted to the summarization application. Note that given the asynchronous nature of the system, multiple steps may execute concurrently on different documents.
sequenceDiagram autonumber FastAPI ->> Service: File bytes
(POST Request) Service ->> Loader: File path Loader ->> Runnable: Langchain
Documents Runnable ->> Runnable: Cache Access Runnable ->> Service: Summary Service -->> FastAPI: Stream/Batch
Response Service ->> Storage: File Bytes + Summary and Metadata
- The FastAPI server receives the file as raw binary data via a POST request.
- The document is temporarily stored locally, and its file path is passed to the Loader for content extraction.
- The Loader processes the file, extracting its content and converting it into LangChain Documents.
- The Langchain Runnable component checks the Cache Service for any existing summary of the document to avoid redundant processing.
- If no cached result is found, the document is forwarded to the LLM service for summarization. The summary is either streamed directly or accumulated for batch responses.
- The generated summary is then returned to the user through FastAPI, either as a stream or a complete batch, depending on the initial user request.
- Finally, the original file bytes, the summary and associated metadata are saved in the Storage Service for future reference and/or debugging.
Streaming & Batching
By design the application utilizes Langchain’s streaming capabilities to generate summaries in real
time, sending each chunk back to the user as it is created via FastAPI’s
StreamingResponse
.
Streaming is particularly useful for on-demand summaries as it naturally breaks the summary into
smaller chunks; as a result this reduces the network bandwidth utilization while also enhancing
user experience. The incremental delivery provides a sense of task progress and increases the user
tolerance for latency, making the summarization process more responsive and predictable.
For use cases where real-time streaming is unnecessary — such as batch-processing pipelines that queue multiple documents for summarization — the application can switch to a batch-oriented execution mode. In this mode, the entire summary is generated by the LLM and sent as a single response once processing is complete. This approach is ideal for scenarios prioritizing simplicity or where network bandwidth and latency are not significant concerns.
Both streaming and batching modes allow the application to adapt flexibly to different workflows, whether it is being used interactively with the frond-end or as a component in a data processing pipeline. Regardless of the execution mode triggered, it collects and stores all relevant metadata in the storage service to ensure proper logging and record-keeping during its operation.