1024 token limit for text summarization?

I started playing around with the pipeline("summarization") based on the first example. From what I can tell it has a 1024 limit… here is what I got when I tried more: (13147 > 1024).

Is it true text summarization is good up to 1024 tokens?

That means if we have a longer document (in this case 13147 tokens), wouldn’t we need to text split and then loop through to summarize each split?

If that is true, then if the summary of the document is a concatenation on each summarized split, the summary is suboptimal.

What do you recommend to do to get as wonderful of a summary with 13,000 tokens as we do with <= 1024 tokens? Thank you

No of token being accepted depends on the model we are using, There are recent LLMs which support more tokens , you should try them.

other hand , yes you are right , for summarization of longer document , there are few strategies as

  1. Stuffing
  2. Map Reduce
    3, Refine

go through github link , you can understand them ,

Thank you very much for the reply. I ended up finding this method to work “ok”:

  1. I used the nltk sentence splitting from Langchain. I experiemented with the character and sentence splitters and found the sentence spitters to make the most sense to me. My thinking is by breaking chunks into sentences then I don’t need to worry about overlapping characters and all content is “gobbled up” by the nlp/ai so I have this:
import nltk

from langchain.text_splitter import NLTKTextSplitter
def split_on_sentences(text,split_size):
    nltk_splitter = NLTKTextSplitter(separator= ' ',chunk_size=split_size, chunk_overlap = 0)
    splits = nltk_splitter.split_text(text)
    split_sizes = [len(split) for split in splits]
    average_split_size = round(sum(split_sizes) / len(split_sizes))
    return {
        'average_split_size' : average_split_size,
        'num_splits': len(splits),
        'splits' : splits

I run the lengthy text through it:

split_dict = split_on_sentences(text,4096)

and onto the “final” summary using the HF transformer:

from transformers import pipeline
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

I also tried the ‘facebook/bart-large-cnn’ model but found the results for the other model to be more to my liking (so I learned about ROUGE 1 and whether a high score is the most important aspect…from this small exploration I find ROUGE 1 of interest, but does not mean the model with the highest score will work the best for my needs…I am new to this stuff!)

from tqdm import tqdm

final_summary = ''
for i, chunk in enumerate(tqdm(split_dict['splits'], desc='Summarizing')):
    summary = summarizer(chunk, max_length=150, min_length=30, do_sample=False)
    final_summary += summary[0]['summary_text'] + " "

I write this final summary into a file, then because I pay $20/mo for ChatGPT4 subscription, I put that text in and now have an expanded and 350 word summary.

The last step (using ChatGPT4 manually) is because I pay $20/mo so I might as well use that. It is a bit tedious, but I’m already paying for this.

I like using the targeted transformer models for summarization because they are fairly fast and is probably doing a similar thing as refine does but it just seems simpler to me to just deal with the text this way.

Again, Thank you VERY MUCH for your reply.

I wouldn’t recommend making overlaps 0 , as most of the sentence will have a context in the previous or next line, so overlaping some extent make sures we are not missing the context of a given line in the paragraph.
If you use GPT3 or 4 , it can do all these easily through proper prompts, but i would recommend achieving it open source models in Huggingface , so that you have non commercial solution.
you can understand the shortcomings, find a better open source model or fine tune it yourself, if you have good dataset.