Data cleaning supported with DocumentLoaders?

I saw the Pdf DocumentLoader example and noticed that there were a number of word breaks that occurred in the 500 characters displayed. Are there any built-in features to help us correct this incorrect ingestion or do we have to employ our own post-processing processes to clean this data? I know how tricky it can be to work with pdf files from experience