UNIVERSITY PARK, Pa. – A search engine that uses artificial intelligence (AI) to “read” millions of documents online could help privacy researchers find those related to online privacy. The researchers who designed the search engine suggest that it could be an important tool for researchers trying to find ways to make the internet safer.
In one study, researchers said the search engine, which they dubbed PrivaSeer, uses a type of AI called natural language processing – NLP – to identify online privacy documents, such as privacy policies, terms of use, cookie policies, bills, and privacy laws, regulatory guidelines and other related texts on the web.
Rather than trying to search for privacy documents themselves, researchers could enter their queries into the search engine to effectively identify and collect the correct documentation.
Ultimately, however, the search engine could help researchers better understand online privacy in general, and examine trends in online privacy over time, which could one day lead to an internet on the Internet. which users could browse more safely and securely, according to Shomir wilson, assistant professor of information science and technology at Penn State and a Institute of Computer Science and Data affiliate.
“This can be a resource for both natural language processing and privacy researchers interested in this area of ​​text,” Wilson said. “Given large volumes of text like this, we can find ways to automatically identify and tag certain data practices that might be of interest to people, which in turn helps create tools to help users understand online privacy. “
NLP combines linguistics, computing and AI to program computers to process and analyze large amounts of text. In this case, the researchers used NLP to collect privacy policy documents from the web, according to Mukund Srinath, a doctoral student in information science and technology and the study’s first author.
“The NLP approach can differentiate between privacy policy documents and privacy policy documents depending on certain words that appear in the text,†Srinath said. “Intuitively, you may think that privacy policies may contain certain words that non-privacy policies don’t, such as data protection and confidentiality, which are just a few of the common words. With the NLP approach, one could say that the algorithm learns to recognize the difference between these two different types of documents.
He added that finding and classifying the privacy literature without machine learning would be time consuming and difficult, if not impossible.
More knowledge of privacy information is needed because this type of documentation is largely ignored by regular users, according to Wilson.
“Most websites present you with information about their data practices and then you’re supposed to consent by browsing and reading all of that information,†Wilson said. “But nobody really does it because it’s inconvenient and doesn’t match the way people use the internet. People generally don’t have the legal knowledge.
The privacy policies were collected by the PrivaSeer search engine during two separate web crawls. Web crawling refers to the systematic browsing of the Internet on a large scale, as performed by software. The first crawl took place in July 2019. The second crawl took place in February 2020.
The PrivaSeer database now consists of around 1.4 million English website privacy policies.
“One of the great things about our database is that we have the greatest snapshot of privacy online,†Wilson said.
Soundarya Nurani Sundareswara, former graduate student in Information Science and Technology, currently a software engineer at Apple, and C. Lee Giles, Professor David Reese at the College of Information Sciences and Technology, both at Penn State, worked with Wilson and Srinath on the project.
The team published its findings in the International Web Engineering Conference.