The big data market has been growing exponentially over the last few years and is estimated to reach a staggering 103 billion dollar line by 2027. Together with rapid advancements in artificial intelligence (AI) and machine learning (ML), this growth further accelerates the demand for publicly available web data.
According to the web intelligence collection platform Oxylabs, 2024 might bring critical developments in the data scraping industry. An influx of new scraping-related legal cases might bring some answers to the questions that were previously left in the legal “gray zone,” impacting the way organizations and individuals collect data online.
Data Ethics Is Undergoing a Renaissance
“The growth in generative AI capabilities exceeded the expectations of most experts, bringing the question of ethical data collection back to the front of the discussions in courtrooms and political institutions,” says Julius Černiauskas, the CEO of Oxylabs. “For quite some time, it seemed that, at least in Europe, GDPR solved the most pressing privacy protection issues; however, an unprecedented scale of data collection needed for the AI training resumed the debates around privacy, the fair use of data, and data ownership.”
Černiauskas foresees that more scraping industry players will join the Ethical Web Data Collection Initiative since the necessity to promote common industry standards and ethical data-gathering practices is becoming as important as ever. “Showing that the web intelligence industry adheres to ethical data collection requirements and respects data fairness is the only way to raise public trust in the wake of ongoing AI lawsuits,” concludes Černiauskas.
Case Law Might Answer Some of the Pressing Industry Questions
According to Denas Grybauskas, the Head of Legal at Oxylabs, ongoing lawsuits targeting AI companies and a web intelligence provider that scraped social media platforms might bring significant changes in data collection practices and even impact further technological developments.
“From the ongoing legal cases, we can see that the most pressing industry questions might be split into two broad categories. The first one concerns intellectual property rights and other AI-related legal questions, for example, what is public domain, and how could we define fair (or unfair) use of publicly available data? Some well-known AI companies have received hostility because they scraped massive amounts of public content that was created by millions of internet users worldwide. OpenAI’s recent decision to reveal its web crawler GPTBot and allow websites to opt-out from being scraped is a measure that, in the near future, might become more common.”
The impact such decisions will have on the further development of AI is yet unknown, says Grybauskas. “Most AI systems today rely on ML technology, which needs a constant influx of data to train underlying algorithms and maintain quality outputs. Restrictions on public data collection might hinder AI innovation.”
According to Grybauskas, the second wave of emerging case law should revolve around privacy and personal data. “Collecting data from publicly available social media pages raised concerns about data privacy, especially that related to minors, as the recent class-action suit against a web scraping provider has shown. Today, consumers demand more accountability and transparency when it comes to handling their data. Despite GDPR being called the toughest privacy law in the world, EU policymakers are facing increased criticism in the media that argues they are failing to force Big Tech to comply with GDPR’s requirements.”
Grybauskas notes that we can also witness a growing effort from policymakers to answer these concerns. “In August, representatives of twelve regulatory bodies from different countries issued a joint statement on data scraping and privacy protection with detailed technical recommendations for social networks. In California, a new bill called the Delete Act has been passed recently, targeting data brokers and establishing additional regulation for personal data collection and management.”
“I believe the question of privacy will remain the main focus area in the discussions around the legality of web scraping in 2024, and this is actually a positive development as it should bring more clarity for all industry players,” highlights Grybauskas.
Demand for Custom Solutions and Datasets Will Grow
As increasingly more businesses are shifting their focus towards data-driven decision-making to beat the competition and enhance efficiency, the need for publicly available web data is expected to grow further over the next year, anticipates Oxylabs’ Chief Commercial Officer Tomas Montvilas.
However, businesses will increasingly demand customized scraping solutions to meet their specific needs, says Montvilas. “Powered by AI and ML, anti-scraping measures are becoming more challenging not only for businesses collecting e-commerce data but also for cybersec companies that have to deal with professional threat actors blocking their threat intelligence efforts.
“Moreover, the volume of data itself and the variation of formats and languages in which the data is collected are posing a growing challenge for web intelligence-powered analytics. To ensure data accuracy and reliability, providers of web scraping solutions will have to rely more on AI and ML technologies to offer adaptive and maintenance-free data scrapers and parsers.”
According to Montvilas, in order to save time, internal resources, and costs, more businesses will switch from collecting data in-house to acquiring custom datasets that are “ready-made” — cleaned, structured, and suitable for analysis. “The main trend for 2024 remains the same: to get actionable insights, mitigate business risks, and manage reputation, companies will have to employ competitive intelligence (CI), connecting their data collection and analysis capabilities into a single system. Web data-powered competitive intelligence is the biggest unused resource with real transformational impact for both private and public sector organizations.”