Getting an error attempting the 2nd lab on RAG pipelines locally

nsubordin81 · September 8, 2024, 8:56pm

Help Needed: HTMLToDocument Converter Error in Customized RAG Lab on M3 Mac

Environment

Hardware: M3 Mac
Running the “Build Customized RAG” lesson locally in a Jupyter notebook
All requirements from the provided requirements.txt are installed

Issue

I’m encountering an error with the HTMLToDocument converter initialization. The error occurs even though I have the trafilatura dependency already installed locally. I have also followed the instructions in the error to additionally pip install lxml[html_clean] to version 0.2.2.

Observed Behavior

The code attempts to lazily import and even pip install the trafilatura dependency.
Despite having trafilatura already installed, it still tries to run the installation in the component source code.
For some reason, it’s getting ‘None’ as the argument in the context manager.

Error Details

{
“name”: “ImportError”,
“message”: “Failed to import ‘None’. Run ‘pip install trafilatura’. Original error: lxml.html.clean module is now a separate project lxml_html_clean.
Install lxml[html_clean] or lxml_html_clean directly.”,
“stack”: "---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/haystack/components/converters/html.py:17
16 with LazyImport("Run ‘pip install trafilatura’") as trafilatura_import:
—> 17 from trafilatura import extract
20 @component
21 class HTMLToDocument:

File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/trafilatura/init.py:18
17 from .baseline import baseline, html2txt
—> 18 from .core import bare_extraction, extract, process_record
19 from .downloads import fetch_response, fetch_url

File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/trafilatura/core.py:16
15 from .deduplication import content_fingerprint, duplicate_test
—> 16 from .external import compare_extraction
17 from .htmlprocessing import build_html_output, convert_tags, prune_unwanted_nodes, tree_cleaning

File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/trafilatura/external.py:12
11 # third-party
—> 12 from justext.core import (ParagraphMaker, classify_paragraphs,
13 revise_paragraph_classification)
14 from justext.utils import get_stoplist # , get_stoplists

File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/justext/init.py:12
11 from .utils import get_stoplists, get_stoplist
—> 12 from .core import justext
15 version = "3.0.1"

File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/justext/core.py:21
19 from backports.functools_lru_cache import lru_cache
—> 21 from lxml.html.clean import Cleaner
22 from xml.sax.handler import ContentHandler

File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/lxml/html/clean.py:18
17 except ImportError:
—> 18 raise ImportError(
19 "lxml.html.clean module is now a separate project lxml_html_clean.
"
20 "Install lxml[html_clean] or lxml_html_clean directly."
21 ) from None

ImportError: lxml.html.clean module is now a separate project lxml_html_clean.
Install lxml[html_clean] or lxml_html_clean directly.

The above exception was the direct cause of the following exception:

ImportError Traceback (most recent call last)
Cell In[28], line 4
1 document_store = InMemoryDocumentStore()
3 fetcher = LinkContentFetcher()
----> 4 converter = HTMLToDocument()
5 embedder = CohereDocumentEmbedder(model="embed-english-v3.0", api_base_url=os.getenv("CO_API_URL"))
6 writer = DocumentWriter(document_store=document_store)

File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/haystack/core/component/component.py:177, in ComponentMeta.call(cls, *args, **kwargs)
175 pre_init_hook = _COMPONENT_PRE_INIT_CALLBACK.get()
176 if pre_init_hook is None:
→ 177 instance = super().call(*args, **kwargs)
178 else:
179 named_positional_args = ComponentMeta.positional_to_kwargs(cls, args)

File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/haystack/components/converters/html.py:54, in HTMLToDocument.init(self, extractor_type, try_others, extraction_kwargs)
37 def init(
38 self,
39 extractor_type: Optional[str] = None,
40 try_others: Optional[bool] = None,
41 extraction_kwargs: Optional[Dict[str, Any]] = None,
42 ):
43 """
44 Create an HTMLToDocument component.
45
(…)
52 the Trafilatura documentation.
53 """
—> 54 trafilatura_import.check()
55 if extractor_type is not None:
56 warnings.warn(
57 "The extractor_type parameter is ignored and will be removed in Haystack 2.4.0. "
58 "To customize the extraction, use the extraction_kwargs parameter.",
59 DeprecationWarning,
60 )

File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/lazy_imports/try_import.py:107, in _DeferredImportExceptionContextManager.check(self)
105 if self._deferred is not None:
106 exc_value, message = self._deferred
→ 107 raise ImportError(message) from exc_value

ImportError: Failed to import ‘None’. Run ‘pip install trafilatura’. Original error: lxml.html.clean module is now a separate project lxml_html_clean.
Install lxml[html_clean] or lxml_html_clean directly."
}

Questions

Has anyone encountered a similar issue with the HTMLToDocument converter, especially on M3 Macs?
Are there any known conflicts or specific versions of trafilatura that work best with this lab on Apple Silicon?
Could this be related to any M3 Mac-specific dependencies or configurations?
I’m also using a virtualenv with python 3.10.13, wondering if that is not in sync with the runtime environment of the lab notebook.
Any suggestions on how to troubleshoot or resolve this error on my system?

Thank you in advance for any help or insights you can provide!

nsubordin81 · September 10, 2024, 3:55am

for now I may have gotten around this using a custom component for the converter with BeautifulSoup. would prefer to be using what was provided in the lesson, but honestly see it as less likely that I’ll be sourcing from streamed html in the immediate practical applications I have. Still interested if anyone else experiences this problem, never figured out whether it was just something I got wrong setting up my environment different from the VM this notebook is running on or if there was a simpler explanation with my dependency setup.

silvanocerza · September 10, 2024, 3:55pm

Hey hey, I found a similar issue on the trafilatura repository, issue 693 to be precise. For some reason I can’t directly link it though. Seems like it could be an issue with your local environment, I would suggest recreating it from scratch and see if it solves the problem.

I tried reproducing it locally but I don’t seem to have the same issue on my M1 Pro, I would expect no different behaviour on an M3 though.

nsubordin81 · September 11, 2024, 10:44pm

Thank you for the reply! I’ll read through that issue. I suspected at some point I set something up in a way that wasn’t going to align with what the cloud based notebook was using. Having solved with beautiful soup for now I may deprioritize investigating what it was, but it is really good to have that suspicion supported if not confirmed and know there are other instances of it. web searches weren’t turning up much in the way of the error, looking at the issues in trafilatura’s repo was a great idea

Topic		Replies	Views
Struggling to install trax on my local machine NLP with Sequence Models week-1	5	562	February 28, 2023
Not able to use Trax libarary in local machine NLP with Probabilistic Models week-3	1	514	December 3, 2022
From Utils import get_prebuild_trulens_recorder Building and Evaluating Advanced RAG Applications	8	428	December 6, 2023
L4 - Validation Error while running response = index.query(query, llm = llm_replacement_model) LangChain for LLM Application Development	15	901	March 13, 2024
V1 or V2? I am having difficulty with V2, better off going to V1? AI Discussions ai-discussions , project	4	57	October 8, 2024