Help Needed: HTMLToDocument Converter Error in Customized RAG Lab on M3 Mac
Environment
- Hardware: M3 Mac
- Running the “Build Customized RAG” lesson locally in a Jupyter notebook
- All requirements from the provided
requirements.txt
are installed
Issue
I’m encountering an error with the HTMLToDocument converter initialization. The error occurs even though I have the trafilatura dependency already installed locally. I have also followed the instructions in the error to additionally pip install lxml[html_clean] to version 0.2.2.
Observed Behavior
- The code attempts to lazily import and even pip install the trafilatura dependency.
- Despite having trafilatura already installed, it still tries to run the installation in the component source code.
- For some reason, it’s getting ‘None’ as the argument in the context manager.
Error Details
{
“name”: “ImportError”,
“message”: “Failed to import ‘None’. Run ‘pip install trafilatura’. Original error: lxml.html.clean module is now a separate project lxml_html_clean.
Install lxml[html_clean] or lxml_html_clean directly.”,
“stack”: "---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/haystack/components/converters/html.py:17
16 with LazyImport("Run ‘pip install trafilatura’") as trafilatura_import:
—> 17 from trafilatura import extract
20 @component
21 class HTMLToDocument:
File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/trafilatura/init.py:18
17 from .baseline import baseline, html2txt
—> 18 from .core import bare_extraction, extract, process_record
19 from .downloads import fetch_response, fetch_url
File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/trafilatura/core.py:16
15 from .deduplication import content_fingerprint, duplicate_test
—> 16 from .external import compare_extraction
17 from .htmlprocessing import build_html_output, convert_tags, prune_unwanted_nodes, tree_cleaning
File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/trafilatura/external.py:12
11 # third-party
—> 12 from justext.core import (ParagraphMaker, classify_paragraphs,
13 revise_paragraph_classification)
14 from justext.utils import get_stoplist # , get_stoplists
File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/justext/init.py:12
11 from .utils import get_stoplists, get_stoplist
—> 12 from .core import justext
15 version = "3.0.1"
File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/justext/core.py:21
19 from backports.functools_lru_cache import lru_cache
—> 21 from lxml.html.clean import Cleaner
22 from xml.sax.handler import ContentHandler
File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/lxml/html/clean.py:18
17 except ImportError:
—> 18 raise ImportError(
19 "lxml.html.clean module is now a separate project lxml_html_clean.
"
20 "Install lxml[html_clean] or lxml_html_clean directly."
21 ) from None
ImportError: lxml.html.clean module is now a separate project lxml_html_clean.
Install lxml[html_clean] or lxml_html_clean directly.
The above exception was the direct cause of the following exception:
ImportError Traceback (most recent call last)
Cell In[28], line 4
1 document_store = InMemoryDocumentStore()
3 fetcher = LinkContentFetcher()
----> 4 converter = HTMLToDocument()
5 embedder = CohereDocumentEmbedder(model="embed-english-v3.0", api_base_url=os.getenv("CO_API_URL"))
6 writer = DocumentWriter(document_store=document_store)
File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/haystack/core/component/component.py:177, in ComponentMeta.call(cls, *args, **kwargs)
175 pre_init_hook = _COMPONENT_PRE_INIT_CALLBACK.get()
176 if pre_init_hook is None:
→ 177 instance = super().call(*args, **kwargs)
178 else:
179 named_positional_args = ComponentMeta.positional_to_kwargs(cls, args)
File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/haystack/components/converters/html.py:54, in HTMLToDocument.init(self, extractor_type, try_others, extraction_kwargs)
37 def init(
38 self,
39 extractor_type: Optional[str] = None,
40 try_others: Optional[bool] = None,
41 extraction_kwargs: Optional[Dict[str, Any]] = None,
42 ):
43 """
44 Create an HTMLToDocument component.
45
(…)
52 the Trafilatura documentation.
53 """
—> 54 trafilatura_import.check()
55 if extractor_type is not None:
56 warnings.warn(
57 "The extractor_type
parameter is ignored and will be removed in Haystack 2.4.0. "
58 "To customize the extraction, use the extraction_kwargs
parameter.",
59 DeprecationWarning,
60 )
File ~/.pyenv/versions/3.10.13/envs/haystack/lib/python3.10/site-packages/lazy_imports/try_import.py:107, in _DeferredImportExceptionContextManager.check(self)
105 if self._deferred is not None:
106 exc_value, message = self._deferred
→ 107 raise ImportError(message) from exc_value
ImportError: Failed to import ‘None’. Run ‘pip install trafilatura’. Original error: lxml.html.clean module is now a separate project lxml_html_clean.
Install lxml[html_clean] or lxml_html_clean directly."
}
Questions
- Has anyone encountered a similar issue with the HTMLToDocument converter, especially on M3 Macs?
- Are there any known conflicts or specific versions of trafilatura that work best with this lab on Apple Silicon?
- Could this be related to any M3 Mac-specific dependencies or configurations?
- I’m also using a virtualenv with python 3.10.13, wondering if that is not in sync with the runtime environment of the lab notebook.
- Any suggestions on how to troubleshoot or resolve this error on my system?
Thank you in advance for any help or insights you can provide!