UNIVERSITY PARK, Pa. – A search engine that uses artificial intelligence (AI) to “read” millions of documents online could help privacy researchers find those related to online privacy. The researchers who designed the search engine suggest that it could be an important tool for researchers trying to find ways to make the internet safer.
Rather than trying to search for privacy documents themselves, researchers could enter their queries into the search engine to effectively identify and collect the correct documentation.
Ultimately, however, the search engine could help researchers better understand online privacy in general, and examine trends in online privacy over time, which could one day lead to an internet on the Internet. which users could browse more safely and securely, according to Shomir wilson, assistant professor of information science and technology at Penn State and a Institute of Computer Science and Data affiliate.
“This can be a resource for both natural language processing and privacy researchers interested in this area of ââtext,” Wilson said. âGiven large volumes of text like this, we can find ways to automatically identify and tag certain data practices that might be of interest to people, which in turn helps create tools to help users understand online privacy. “
He added that finding and classifying the privacy literature without machine learning would be time consuming and difficult, if not impossible.
More knowledge of privacy information is needed because this type of documentation is largely ignored by regular users, according to Wilson.
âMost websites present you with information about their data practices and then you’re supposed to consent by browsing and reading all of that information,â Wilson said. âBut nobody really does it because it’s inconvenient and doesn’t match the way people use the internet. People generally don’t have the legal knowledge.
The privacy policies were collected by the PrivaSeer search engine during two separate web crawls. Web crawling refers to the systematic browsing of the Internet on a large scale, as performed by software. The first crawl took place in July 2019. The second crawl took place in February 2020.
The PrivaSeer database now consists of around 1.4 million English website privacy policies.
âOne of the great things about our database is that we have the greatest snapshot of privacy online,â Wilson said.
Soundarya Nurani Sundareswara, former graduate student in Information Science and Technology, currently a software engineer at Apple, and C. Lee Giles, Professor David Reese at the College of Information Sciences and Technology, both at Penn State, worked with Wilson and Srinath on the project.
The team published its findings in the International Web Engineering Conference.