Some notes taken at the Milner Award Lecture by Dr Serge Abiteboul for the Royal Society on 12th November, From data and information to knowledge: the Web of tomorrow. Dr Abiteboul was awarded the 2013 Milner Award, given annually for outstanding achievement in computer science by a European researcher.
Dr Abiteoul’s research work focuses mainly on data, information and knowledge management, particularly on the Web. Like NetIKX members, he is interested in the transition from data to knowledge. Among many prestigious projects, he has worked on Apple’s Siri interface and Active XML, a declarative framework that harnesses web services for data integration.
In a charming French accent, he explained to us that he was going to talk about networks – networks of machines (Internet), of content (Web) and people (social media).
Nowdays information is everywhere, worldwide. Everything is big and getting bigger – the size of the digital world is estimated to be doubling every 18 months. A web search engine now is a cluster of machines – maybe a million machines. In the past getting ten machines to work together was a big challenge! Engineering achievements have enabled hundreds of thousands of computers to work together.
Dr Abiteoul’s assumptions
1. The size will continue to grow
2. The information will continue to be managed by many systems (rather than a company like Facebook taking over all the world’s information).
3. These systems will be intelligent – in the sense that they produce/consume knowledge and not simply raw data.
The 4 + 1 V’s of Big Data…
Volume, Velocity, Variety, Veracity = four difficulties of big data. There is a huge mass of data, more than can be retrieved. And it is changing fast, particularly sets of data like the stock market. Furthermore, the information on the web is uncertain, full of imprecisions and contradictions. Search engines must contend with lies and opinions, not just facts.
Dr Abiteoul’s +1 is Value – the bottom line is, what value comes from all this data? How does a computer decide what is important to present?
Data analysis is a technical challenge as old as computer science. We know how to do it with a small amount of data; the next challenge is to do it with a huge amount. Complex algorithms will have to be designed. These will need to do low level statistical analysis, because finding the perfect statistics will take too long. Maths, informatics, engineering and hardware are all needed.
But of the tree of the knowledge of good and evil, thou shalt not eat of it: for in the day that thou eatest thereof thou shalt surely die. (Genesis 2.17)
People often prefer being given one answer rather than a multitude of options to sort through. When we ask another person an answer, they don’t reply by giving us twenty pages to read through, so why should we interact with machines (search engines) like that? (Note – should information professionals be very selective and choosy with the information we put forward to customers, would they prefer a reading list of five books rather than twenty?).
Machines prefer formatted knowledge, logical statements. Machines can be programmed to find patterns – e.g. Woody Allen ‘is married to’ Soon-Yi Previn. But people write that two people are married in many different ways. How does a search engine cope with all the false statements and contradictions, e.g. ‘Elvis Presley died on 16 August 1977’ and ‘The King is alive’!
The real problem with the accuracy of Wikipedia is not incorrect amateurs but paid professionals with their own agenda, paid by companies to take a particular viewpoint.
The difficulty is when to stop searching – when to find just enough right answers. Precision, the fraction of results that are correct, must be balanced between the amount of results retrieved. There is a trade off between finding more knowledge and finding the correct knowledge. Machines will have to be programmed to separate the wheat from the chaff. Knowing the good sources, the trustable sources, is a huge advantage for this.
Next, Dr Abiteoul mentioned librarians! He praised the way that a librarian may suggest you read an article that transforms your research. Or you may hear by chance a song that totally obsesses you. Computers lack this serendipity – they’re square. Information professionals take heart: there is value in chance, in browsing shelves, in the ability of your brain to make suggestions computers wouldn’t.
We cannot archive all the data we produce – there’s a lack of storage resources. How do we choose what we keep? The British Library is tackling this question through its UK Web Archive project, which involves archiving 4.8 million UK websites and one billion web pages.
The BL Web Archiving page says: “We are developing a new crowd sourcing application that will use Twitter to support an automated selection process. We envisage that in the future, automated selection of this sort will compliment manual selection by subject experts, resulting in a more representative and well-rounded set of collections.” So perhaps the web of the future will need both expert people and star computing systems.
The decisions of machines
Decisions are increasingly made by machines. For instance, automated transport systems like the Docklands Light Railway, or auto trading on the stock market. How far do we go with this, asked Dr Abiteoul. Would a machine be allowed to decide that someone is a terrorist and kill them, and if so at what level of certainty? At 90% sure? At 95% sure?
Soon machines will acquire knowledge for us, remember knowledge for us, reason for us. We should get prepared by learning informatics, so that we understand them.
There were so many ideas flying about that I was unable to note them all down! Luckily the whole lecture is freely available to watch at www.youtube.com/watch?v=to9_Xc9f96E.
Blog post by Emily Heath.