I’m trying to load some documents, powerpoints and text to train my custom LLm using Langchain.
When I run it I come to a weird error message where it tells I don’t have “tokenizers” and “taggers” packages (folders).
I’ve read the docs, asked Langchain chatbot, pip install nltk, uninstall, pip install nltk without dependencies, added them with nltk.download(), nltk.download(“punkt”), nltk.download(“all”),… Did also manually put on the path: nltk.data.path = [‘C:\Users\zaesa\AppData\Roaming\nltk_data’] and added all the folders. Added the tokenizers folder and taggers folder from the github repo: . Everything. Also asked on the Github repo. Nothing, no success.
Here the code of the file I try to run:
from nltk.tokenize import sent_tokenize
from langchain.document_loaders import UnstructuredPowerPointLoader, TextLoader, UnstructuredWordDocumentLoader
from dotenv import load_dotenv, find_dotenv
import os
import openai
import sys
import nltk
nltk.data.path = ['C:\Users\zaesa\AppData\Roaming\nltk_data']
nltk.download(
'punkt', download_dir='C:\Users\zaesa\AppData\Roaming\nltk_data')
sys.path.append('../..')
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']
folder_path_docx = "DB\ DB VARIADO\DOCS"
folder_path_txt = "DB\BLOG-POSTS"
folder_path_pptx_1 = "DB\PPT DAY JUNIO"
folder_path_pptx_2 = "DB\DB VARIADO\PPTX"
loaded_content = []
for file in os.listdir(folder_path_docx):
if file.endswith(".docx"):
file_path = os.path.join(folder_path_docx, file)
loader = UnstructuredWordDocumentLoader(file_path)
docx = loader.load()
loaded_content.extend(docx)
for file in os.listdir(folder_path_txt):
if file.endswith(".txt"):
file_path = os.path.join(folder_path_txt, file)
loader = TextLoader(file_path, encoding='utf-8')
text = loader.load()
loaded_content.extend(text)
for file in os.listdir(folder_path_pptx_1):
if file.endswith(".pptx"):
file_path = os.path.join(folder_path_pptx_1, file)
loader = UnstructuredPowerPointLoader(file_path)
slides_1 = loader.load()
loaded_content.extend(slides_1)
for file in os.listdir(folder_path_pptx_2):
if file.endswith(".pptx"):
file_path = os.path.join(folder_path_pptx_2, file)
loader = UnstructuredPowerPointLoader(file_path)
slides_2 = loader.load()
loaded_content.extend(slides_2)
print(loaded_content[0].page_content)
print(nltk.data.path)
installed_packages = nltk.downloader.Downloader(
download_dir='C:\Users\zaesa\AppData\Roaming\nltk_data').packages()
print(installed_packages)
sent_tokenize("Hello. How are you? I'm well.")
When running the file I get:
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading tokenizers: Package 'tokenizers' not found
[nltk_data] in index
[nltk_data] Error loading taggers: Package 'taggers' not found in
[nltk_data] index
HERE SOME TEXT -
['C:\Users\zaesa\AppData\Roaming\nltk_data']
dict_values([, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ])
I have fresh installed nltk with no dependencies. The version is the latest. The support team from NLTK doesn’t know what is wrong. It seems everything is fine. So, it has to be a bug or something coming from Langchain that I’m not able to see. Really appreciate any help. Need to make this work! Thank you.