1024 token limit for text summarization?

happyday · July 26, 2023, 9:09pm

I started playing around with the pipeline("summarization") based on the first example. From what I can tell it has a 1024 limit… here is what I got when I tried more: (13147 > 1024).

Is it true text summarization is good up to 1024 tokens?

That means if we have a longer document (in this case 13147 tokens), wouldn’t we need to text split and then loop through to summarize each split?

If that is true, then if the summary of the document is a concatenation on each summarized split, the summary is suboptimal.

What do you recommend to do to get as wonderful of a summary with 13,000 tokens as we do with <= 1024 tokens? Thank you

Raja_Sekhar · July 28, 2023, 12:51am

No of token being accepted depends on the model we are using, There are recent LLMs which support more tokens , you should try them.

other hand , yes you are right , for summarization of longer document , there are few strategies as

Stuffing
Map Reduce
3, Refine

go through github link , you can understand them ,

github.com

GoogleCloudPlatform/generative-ai/blob/main/language/examples/document-summarization/summarization_large_documents_langchain.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "uxCkB_DXTHzf"
   },
   "outputs": [],
   "source": [
    "# Copyright 2023 Google LLC\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",

This file has been truncated. show original

happyday · July 29, 2023, 11:12am

Thank you very much for the reply. I ended up finding this method to work “ok”:

I used the nltk sentence splitting from Langchain. I experiemented with the character and sentence splitters and found the sentence spitters to make the most sense to me. My thinking is by breaking chunks into sentences then I don’t need to worry about overlapping characters and all content is “gobbled up” by the nlp/ai so I have this:

import nltk
nltk.download('punkt')

from langchain.text_splitter import NLTKTextSplitter
def split_on_sentences(text,split_size):
    nltk_splitter = NLTKTextSplitter(separator= ' ',chunk_size=split_size, chunk_overlap = 0)
    splits = nltk_splitter.split_text(text)
    split_sizes = [len(split) for split in splits]
    average_split_size = round(sum(split_sizes) / len(split_sizes))
    return {
        'average_split_size' : average_split_size,
        'num_splits': len(splits),
        'splits' : splits
    }

I run the lengthy text through it:

split_dict = split_on_sentences(text,4096)

and onto the “final” summary using the HF transformer:

from transformers import pipeline
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

I also tried the ‘facebook/bart-large-cnn’ model but found the results for the other model to be more to my liking (so I learned about ROUGE 1 and whether a high score is the most important aspect…from this small exploration I find ROUGE 1 of interest, but does not mean the model with the highest score will work the best for my needs…I am new to this stuff!)

from tqdm import tqdm

final_summary = ''
for i, chunk in enumerate(tqdm(split_dict['splits'], desc='Summarizing')):
    summary = summarizer(chunk, max_length=150, min_length=30, do_sample=False)
    final_summary += summary[0]['summary_text'] + " "

I write this final summary into a file, then because I pay $20/mo for ChatGPT4 subscription, I put that text in and now have an expanded and 350 word summary.

The last step (using ChatGPT4 manually) is because I pay $20/mo so I might as well use that. It is a bit tedious, but I’m already paying for this.

I like using the targeted transformer models for summarization because they are fairly fast and is probably doing a similar thing as refine does but it just seems simpler to me to just deal with the text this way.

Again, Thank you VERY MUCH for your reply.

Raja_Sekhar · July 31, 2023, 9:29am

I wouldn’t recommend making overlaps 0 , as most of the sentence will have a context in the previous or next line, so overlaping some extent make sures we are not missing the context of a given line in the paragraph.
If you use GPT3 or 4 , it can do all these easily through proper prompts, but i would recommend achieving it open source models in Huggingface , so that you have non commercial solution.
you can understand the shortcomings, find a better open source model or fine tune it yourself, if you have good dataset.

Topic		Replies	Views
Handling large number of tokens ChatGPT Prompt Engineering for Developers	4	193	April 29, 2023
Text summarization AI Discussions	6	78	November 16, 2023
Summarizing across documents LangChain for LLM Application Development	8	469	July 8, 2023
ConversationSummaryBufferMemory LangChain for LLM Application Development ai-discussions	0	32	August 27, 2024
How to use AI for extracting text highlights? AI Discussions ai-discussions	5	512	February 3, 2024

1024 token limit for text summarization?

Related topics