Thank you very much for the reply. I ended up finding this method to work “ok”:
- I used the nltk sentence splitting from Langchain. I experiemented with the character and sentence splitters and found the sentence spitters to make the most sense to me. My thinking is by breaking chunks into sentences then I don’t need to worry about overlapping characters and all content is “gobbled up” by the nlp/ai so I have this:
import nltk
nltk.download('punkt')
from langchain.text_splitter import NLTKTextSplitter
def split_on_sentences(text,split_size):
nltk_splitter = NLTKTextSplitter(separator= ' ',chunk_size=split_size, chunk_overlap = 0)
splits = nltk_splitter.split_text(text)
split_sizes = [len(split) for split in splits]
average_split_size = round(sum(split_sizes) / len(split_sizes))
return {
'average_split_size' : average_split_size,
'num_splits': len(splits),
'splits' : splits
}
I run the lengthy text through it:
split_dict = split_on_sentences(text,4096)
and onto the “final” summary using the HF transformer:
from transformers import pipeline
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
I also tried the ‘facebook/bart-large-cnn’ model but found the results for the other model to be more to my liking (so I learned about ROUGE 1 and whether a high score is the most important aspect…from this small exploration I find ROUGE 1 of interest, but does not mean the model with the highest score will work the best for my needs…I am new to this stuff!)
from tqdm import tqdm
final_summary = ''
for i, chunk in enumerate(tqdm(split_dict['splits'], desc='Summarizing')):
summary = summarizer(chunk, max_length=150, min_length=30, do_sample=False)
final_summary += summary[0]['summary_text'] + " "
I write this final summary into a file, then because I pay $20/mo for ChatGPT4 subscription, I put that text in and now have an expanded and 350 word summary.
The last step (using ChatGPT4 manually) is because I pay $20/mo so I might as well use that. It is a bit tedious, but I’m already paying for this.
I like using the targeted transformer models for summarization because they are fairly fast and is probably doing a similar thing as refine
does but it just seems simpler to me to just deal with the text this way.
Again, Thank you VERY MUCH for your reply.