How does the quality of training data degrade under the influence of anti-crawling?

Recently, when optimizing a multimodal crawler system, it was found that 23% of the samples of product data used to train e-commerce recommendation models had “geographic bias pollution” - the page structure/price/recommendation position seen by American users, compared with the same product page captured by German IP, the key features differed by 41%!

This leads to two key questions:

Data authenticity crisis: When the anti-crawling system returns fake data (such as random prices/hidden comments) to a specific IP, is the model trained on “junk data”?

Distribution bias blind spot: How does the difference in page content obtained by IPs in different regions affect the performance of cross-regional deployment models?

The current anti-crawling mechanism poses three challenges to AI developers:
▸ Feature distortion: E-commerce hides 17% of comment sentiment labels from data center IPs
▸ Timing interference: News websites return article versions that lag 8-16 hours for unconventional IPs
▸ Environment dependence: We used VPN to switch IPs from 5 countries and found that the difference in mobile page structure was as high as 68%

We have tried some solutions, but there are still limitations:

Head rotation proxy: Reduces the ban rate but cannot solve the environment difference (error rate is still >29%)

Distributed crawler: Costs soared 3 times and encountered ASN blacklist issues

Reinforcement learning behavior simulation: Only 0.81 real-person similarity was achieved on the MouseTrack dataset

I want to know, when you obtain training data, have you tested the content consistency in different IP environments? For scenarios that require multi-geographical dimension data, how to verify that the data is not contaminated by the anti-crawling system? If there is a technical solution that successfully breaks through advanced anti-crawling, please share it with me.