Making true connections in a complex world – Graph database technology and Linked Open Data – 25th January 2018

Conrad Taylor writes:

The first NetIKX meeting of 2018, on 25 January, looked at new technologies and approaches to managing data and information, escaping the limitations of flat-file and relational databases. Dion Lindsay introduced the concepts behind ‘graph databases’, and David Clarke illustrated the benefits of the Linked Data approach with case studies, where the power of a graph database had been enhanced by linking to publicly available resources. The two presentations were followed by a lively discussion, which I also report here.

The New Graph Technology of Information – Dion Lindsay

dionlindsayDion is an independent consultant well known to NetIKX members. He offered us a simple introduction to graph database technology, though he avers he is no expert in the subject. He’d been feeling unclear about the differences between managing data and information, and thought one way to explore that could be to study a ‘fashionable’ topic with a bit of depth to it. He finds graph database technology exciting, and thinks data- and information-managers should be excited about it too!

Flat-file and relational database models

In the last 40 years, the management of data with computers has been dominated by the Relational Database model devised in 1970 by Edgar F Codd, an IBM employee at their San José Research Center.

FLAT FILE DATABASES. Until then (and also for some time after), the model for storing data in a computer system was the ‘Flat File Database’ — analogous to a spreadsheet with many rows and columns. Dion presented a made-up example in which each record was a row, with the attributes or values being stored in fields, which were separated by a delimiter character (he used the | sign, which is #124 in most text encoding systems such as ASCII).

Example: Lname, Fname, Age, Salary|Smith, John, 35, £280|
Doe, Jane 28, £325|Lindsay, Dion, 58, £350…

In older flat-file systems, each individual record was typically input via a manually-prepared 80-column punched card, and the ingested data was ‘tabulated’ (made into a table); but there were no explicit relationships between the separate records. The data would then be stored on magnetic tape drives, and searching through those for a specific record was a slow process.

To search such a database with any degree of speed required loading the whole assembled table into RAM, then scanning sequentially for records that matched the terms of the query; but in those early days the limited size of RAM memory meant that doing anything clever with really large databases was not possible. They were, however, effective for sequential data processing applications, such as payroll, or issuing utility bills.

IBM-2311

The IBM 2311 (debut 1964) was
an early hard drive unit with 7.25 MB storage. (Photo from Wikimedia Commons user
‘I, Deep Silence’
[Details])

HARD DISKS and RELATIONAL DATABASES. Implementing Codd’s relational database management model (RDBM) was made possible by a fast-access technology for indexed file storage, the hard disk drive, which we might call ‘pseudo-RAM’. Hard drives had been around since the late fifties (the first was a component of the IBM RAMAC mainframe, storing 3.75 MB on nearly a ton of hardware), but it always takes time for the paradigm to shift…

By 1970, mainframe computers were routinely being equipped with hard disk packs of around 100 MB (example: IBM 3330). In 1979 Oracle beat IBM to market with the first Relational Database Management System (RDBMS). Oracle still has nearly half the global market share, with competition from IBM’s DB2, Microsoft SQL Server, and a variety of open source products such as MySQL and PostgreSQL.

As Dion pointed out, it was now possible to access, retrieve and process records from a huge enterprise-level database without having to read the whole thing into RAM or even know where it was stored on the disk; the RDBMS software and the look-up tables did the job of grabbing the relevant entities from all of the tables in the system.

TABLES, ATTRIBUTES, KEYS: In Codd’s relational model, which all these RDBMS applications follow, data is stored in multiple tables, each representing a list of instances of an ‘entity type’. For example, ‘customer’ is an entity type and ‘Jane Smith’ is an instance of that; ‘product’ is an entity type and ‘litre bottle of semi-skimmed milk’ is an instance of that. In a table of customer-entities, each row will represents a different customer, and columns may associate that customer with attributes such as her address or loyalty-card number.

One of the attribute columns is used as the Primary Key to quickly access that row of the table; in a classroom, the child’s name could be used as a ‘natural’ primary key, but most often a unique and never re-used or altered artificial numerical ID code is generated (which gets around the problem of having two Jane Smiths).

Possible/permitted relationships can then be stated between all the different entity types; a list of ‘Transactions’ brings a ‘Customer’ into relationship with a particular ‘Product’, which has an ‘EAN’ code retrieved at the point of sale by scanning the barcode, and this retrieves the ‘Price’. The RDBMS can create temporary and supplementary tables to mediate these relationships efficiently.

Limitations of RDBMs, benefits of graphs

However, there are some kinds of data which RDBMSs are not good at representing, said Dion. And many of these are the sorts of thing that currently interest those who want to make good use of the ‘big data’ in their organisations. Dion noted:

  • situations in which changes in one piece of data mean that another piece of data has changed as well;
  • representation of activities and flows.

Suppose, said Dion, we take the example of money transfers between companies. Company A transfers a sum of money to Company B on a particular date; Company B later transfers parts of that money to other companies on a variety of dates. And later, Company A may transfer monies to all these entities, and some of them may later transfer funds in the other direction… (or to somewhere in the British Virgin Islands?)

Graph databases represent these dynamics with circles for entities and lines between them, to represent connections between the entities. Sometimes the lines are drawn with arrows to indicate directionality, sometimes there is none. (This use of the word ‘graph’ is not be confused with the diagrams we drew at school with x and y axes, e.g. to represent value changes over time.)

This money-transfer example goes some way towards describing why companies have been prepared to spend money on graph data technologies since about 2006 – it’s about money laundering and compliance with (or evasion of?) regulation. And it is easier to represent and explore such transfers and flows in graph technology.

Dion had recently watched a YouTube video in which an expert on such situations said that it is technically possible to represent such relationships within an RDBMS, but it is cumbersome.


NetIKX-tablegroups

Most NetIKX meetings incorporate one or two table-group
sessions to help people make sense of what they have learned. Here, people
are drawing graph data diagrams to Dion Lindsay’s suggestions.

Exercise

To get people used to thinking along graph database lines, Dion distributed a sheet of flip chart paper to each table, and big pens were found, and he asked each table group to start by drawing one circle for each person around the table, and label them.

The next part of the exercise was to create a circle for NetIKX, to which we all have a relationship (as a paid-up member or paying visitor), and also circles representing entities to which only some have a relation (such as employers or other organisations). People should then draw lines to link their own circle-entity to these others.

Dion’s previous examples had been about money-flows, and now he was asking us to draw lines to represent money-flows (i.e. if you paid to be here yourself, draw a line from you to NetIKX; but if your organisation paid, that line should go from your organisation-entity to NetIKX). I noted that aspect of the exercise engendered some confusion about the breadth of meaning that lines can carry in such a graph diagram. In fact they can represent any kind of relationship, so long as you have defined it that way, as Dion later clarified.

Dion had further possible tasks up his sleeve for us, but as time was short he drew out some interim conclusions. In graph databases, he summarised, you have connections instead of tables. These systems can manage many more complexities of relationships that either a RDBMS could cope with, or that we could cope with cognitively (and you can keep on adding complexity!). The graph database system can then show you what comes out of those complexities of relationship, which you had not been able to intuit for yourself, and this makes it a valuable discovery tool.

HOMEWORK: Dion suggested that as ‘homework’ we should take a look at an online tool and downloadable app which BP have produced to explore statistics of world energy use. The back end of this tool, Dion said, is based on a graph database.

https://www.bp.com/en/global/corporate/energy-economics/energy-charting-tool.html


Building Rich Search and Discovery: User Experiences with Linked Open Data – David Clarke

daveclarke

DAVE CLARKE is the co-founder, with Trish Yancey, of Synaptica LLC, which since 1995 has developed
enterprise-level software for building and maintaining many different types of knowledge organisation systems. Dave announced that he would talk about Linked Data applications, with some very practical illustrations of
what can be done with this approach.

The first thing to say is that Linked Data is based on an ‘RDF Graph’ — that is, a tightly-defined data structure, following norms set out in the Resource Description Framework (RDF) standards described by the World Wide Web Consortium (W3C).

In RDF, statements are made about resources, in expressions that take the form: subject – predicate – object. For example: ‘daffodil’ – ‘has the colour’ – ‘yellow’. (Also, ‘daffodil’ – ‘is a member of’ – ‘genus Narcissus’; and ‘Narcissus pseudonarcissus’ – ‘is a type of’ – ‘daffodil’.)

Such three-part statements are called ‘RDF triples’ and so the kind of database that manages them is often called an ‘RDF triple store’. The triples can also be represented graphically, in the manner that Dion had introduced us to, and can build up into a rich mass of entities and concepts linked up to each other.

Describing Linked Data and Linked Open Data

Dion had got us to do an exercise at our tables, but each table’s graph didn’t communicate with any other’s, like separate fortresses. This is the old database model, in which systems are designed not to share data. There are exceptions of course, such as when a pathology lab sends your blood test results to your GP, but those acts of sharing follow strict protocols.

Linked Data, and the resolve to be Open, are tearing down those walls. Each entity, as represented by the circles on our graphs, now gets its own ‘HTTP URI’, that is, its own unique Universal Resource Identifier, expressed with the methods of the Web’s Hypertext Transfer Protocol — in effect, it gets a ‘Web address’ and becomes discoverable on the Internet, which in turn means that connections between entities are both possible and technically fairly easy and fast to implement.

And there are readily accessible collections of these URIs. Examples include:

We are all familiar with clickable hyperlinks on Web pages – those links are what weaves the ‘classic’ Web. However, they are simple pointers from one page to another; they are one-way, and they carry no meaning other than ‘take me there!’

In contrast, Linked Data links are semantic (expressive of meaning) and they express directionality too. As noted above, the links are known in RDF-speak as ‘predicates’, and they assert factual statements about why and how two entities are related. Furthermore, the links themselves have ‘thinginess’ – they are entities too, and those are also given their own URIs, and are thus also discoverable.

People often confuse Open Data and Linked Data, but they are not the same thing. Data can be described as being Open if it is available to everyone via the Web, and has been published under a liberal open licence that allows people to re-use it. For example, if you are trying to write an article about wind power in the UK, there is text and there are tables about that on Wikipedia, and the publishing licence allows you to re-use those facts.

Stairway through the stars

Tim Berners-Lee, who invented the Web, has more recently become an advocate of the Semantic Web, writing about the idea in detail in 2005, and has argued for how it can be implemented through Linked Data. He proposes a ‘5-star’ deployment scheme for Open Data, with Linked Open Data being the starriest and best of all. Dave in his slide-set showed a graphic shaped like a five-step staircase, often used to explain this five-star system:

starsteps

The ‘five-step staircase’ diagram often used to explain the hierarchy of Open Data types

  • One Star: this is when you publish your data to the Web under open license conditions, in whatever format (hopefully one like PDF or HTML for which there is free of charge reading software). It’s publishable with minimal effort, and the reader can look at it, print it, download and store it, and share it with others. Example: a data table that has been published as PDF.
  • Two stars: this is where the data is structured and published in a format that the reader can process with software that accesses and works with those structures. The example given was a Microsoft Excel spreadsheet. If you have Excel you can perform calculations on the data and export it to other structured formats. Other two-star examples could be distributing a presentation slide set as PowerPoint, or a document as Word (though when it comes to presentational forms, there are font and other dependencies that can trip us up).
  • Three stars: this is where the structure of a data document has been preserved, but in a non-proprietary format. The example given was of an Excel spreadsheet exported as a CSV file (comma-separated values format, a text file where certain characters are given the role of indicating field boundaries, as in Dion’s example above). [Perhaps the edges of this category have been abraded by software suites such as OpenOffice and LibreOffice, which themselves use non-proprietary formats, but can open Microsoft-format files.]
  • Four stars: this is perhaps the most difficult step to explain, and is when you put the data online in a graph database format, using open standards such as Resource Description Framework (RDF), as described above. For the publisher, this is no longer such a simple process and requires thinking about structures, and new conversion and authoring processes. The advantage to the users is that the links between the entities can now be explored as a kind of extended web of facts, with semantic relationships constructed between them.
  • Five stars: this is when Linked Data graph databases, structured to RDF standards, ‘open up’ beyond the enterprise, and establish semantic links to other such open databases, of which there are increasingly many. This is Linked Open Data! (Note that a Linked Data collection held by an enterprise could be part-open and part-closed. There are often good commercial and security reasons for not going fully open.)

This hierarchy is explained in greater detail at http://5stardata.info/en/

Dave suggested that if we want to understand how many organisations currently participate in the ‘Linked Open Data Cloud’, and how they are linked, we might visit http://lod-cloud.net, where there is an interactive and zoomable SVG graphic version showing several hundred linked databases. The circles that represent them are grouped and coloured to indicate their themes and, if you hover your cursor over one circle, you will see an information box, and be able to identify the incoming and outgoing links as they flash into view. (Try it!)

The largest and most densely interlinked ‘galaxy’ in the LOD Cloud is in the Life Sciences; other substantial ones are in publishing and librarianship, linguistics, and government. One of the most central and most widely linked is DBpedia, which extracts structured data created in the process of authoring and maintaining Wikipedia articles (e.g. the structured data in the ‘infoboxes’). DBpedia is big: it stores nine and a half billion RDF triples!

LOD-interactive

Screen shot taken while zooming into the heart of the Linked Open Data Cloud (interactive version). I have positioned the cursor over ‘datos.bne.es’ for this demonstration. This brings up an information box, and lines which show links to other LOD sites: red links are ‘incoming’ and green links are ‘outgoing’.

The first case study Dave presented was an experiment conducted by his company Synaptica to enhance discovery of people in the news, and stories about them. A ready-made LOD resource they were able to use was DBpedia’s named graph of people. (Note: the Named Graphs data model is a variant on the RDF data model,: it allows RDF triples to talk about RDF graphs. This creates a level of metadata that assists searches within a graph database using the SPARQL query language).

Many search and retrieval solutions focus on indexing a collection of data and documents within an enterprise – ‘in a box’ if you like – and providing tools to rummage through that index and deliver documents that may meet the user’s needs. But what if we could also search outside the box, connecting the information inside the enterprise with sources of external knowledge?

The second goal of this Synaptica project was about what it could deliver for the user: they wanted search to answer questions, not just return a bunch of relevant electronic documents. Now, if you are setting out to answer a question, the search system has to be able to understand the question…

For the experiment, which preceded the 2016 US presidential elections, they used a reference database of about a million news articles, a subset of a much larger database made available to researchers by Signal Media (https://signalmedia.co). Associated Press loaned Synaptica their taxonomy collection, which covers more than 200,000 concepts covering names, geospatial entities, news topics etc. – a typical and rather good taxonomy scheme.

The Linked Data part was this: Synaptica linked entities in the Associated Press taxonomy out to DBpedia. If a person is famous, DBpedia will have hundreds of data points about that person. Synaptica could then build on that connection to external data.

SHOWING HOW IT WORKS. Dave went online to show a search system built with the news article database, the AP taxonomy, and a link out to the LOD cloud, specifically DBpedia’s ‘persons’ named graph. In the search box he typed ‘Obama meets Russian President’. The results displayed noted the possibility that Barack or Michelle might match ‘Obama’, but unhesitatingly identified the Russian President as ‘Vladimir Putin’ – not from a fact in the AP resource, but by checking with DBpedia.

As a second demo, he launched a query for ‘US tennis players’, then added some selection criteria (‘born in Michigan’). That is a set which includes news stories about Serena Williams, even though the news articles about Serena don’t mention Michigan or her birth-place. Again, the link was made from the LOD external resource. And Dave then narrowed the field by adding the criterion ‘after 1980’, and Serena stood alone.

It may be, noted Dave, that a knowledgeable person searching a knowledgebase, be it on the Web or not, will bring to the task much personal knowledge that they have and that others don’t. What’s exciting here is using a machine connected to the world’s published knowledge to do the same kind of connecting and filtering as a knowledgeable person can do – and across a broad range of fields of knowledge.

NATURAL LANGUAGE UNDERSTANDING. How does this actually work behind the scenes? Dave again focused on the search expressed in text as ‘US tennis players born in Michigan after 1980’. The first stage is to use Natural Language Understanding (NLU), a relative of Natural Language Processing, and long considered as one of the harder problem areas in Artificial Intelligence.

The Synaptica project uses NLU methods to parse extended phrases like this, and break them down into parts of speech and concept clusters (‘tennis players’, ‘after 1980’). Some of the semantics are conceptually inferred: in ‘US tennis players’, ‘US’ is inferred contextually to indicate nationality.

On the basis of these machine understandings, the system can then launch specific sub-queries into the graph database, and the LOD databases out there, before combining them to derive a result. For example, the ontology of DBpedia has specific parameters for birth date, birthplace, death date, place of death… These enhanced definitions can bring back the lists of qualifying entities and, via the AP taxonomy, find them in the news content database.

Use case: understanding symbolism inside art images

Dave’s second case study concerned helping art history students make searches inside images with the aid of a Linked Open Data resource, the Getty Art and Architecture Thesaurus.

A seminal work in Art History is Erwin Panofsky’s Studies in Iconology (1939), and Dave had re-read it in preparation for building this application, which is built on Panofskyan methods. Panofsky describes three levels of analysis of iconographic art images:

  • Natural analysis gives a description of the visual evidence. It operates at the level of methods of representation, and its product is an annotation of the image (as a whole, and its parts).
  • Conventional analysis (Dave prefers the term ‘conceptual analysis’) interprets the conventional meanings of visual components: the symbolism, allusions and ideas that lie behind them. This can result in semantic indexing of the image and its parts.
  • Intrinsic analysis explores the wider cultural and historical context. This can result in the production of ‘knowledge graphs’

 

earthlydelights

Detail from the left panel of Hieronymous Bosch’s painting ‘The Garden of Earthly Delights’, which is riddled with symbolic iconography.

THE ‘LINKED CANVAS’ APPLICATION.

The educational application which Synaptica built is called Linked Canvas (see http://www.linkedcanvas.org/). Their first step was to ingest the art images at high resolution. The second step was to ingest linked data ontologies such as DBpedia, Europeana, Wikidata, Getty AAT, Library of Congress Subject Headings and so on.

The software system then allows users to delineate Points of Interest (POIs), and annotate them at the natural level; the next step is the semantic indexing, which draws on the knowledge of experts and controlled vocabularies.
Finally users get  to benefit from tools
for search and exploration of the
annotated images.

With time running tight, Dave skipped straight to some live demos of examples, starting with the fiendishly complex 15th century triptych painting The Garden of Earthly Delights. At Panofsky’s level of ‘natural analysis’, we can decompose the triptych space into the left, centre and right panels. Within each panel, we can identify ‘scenes’, and analyse further into details, in a hierarchical spatial array, almost the equivalent of a detailed table of contents for a book. For example, near the bottom of the left panel there is a scene in which God introduces Eve to Adam. And within that we can identify other spatial frames and describe what they look like (for example, God’s right-hand gesture of blessing).

To explain semantic indexing, Dave selected an image painted 40 years after the Bosch — Hans Holbein the Younger’s The Ambassadors, which is in the National Gallery in London. This too is full of symbolism, much of it carried by the various objects which litter the scene, such as a lute with a broken string, a hymnal in a translation by Martin Luther, a globe, etc. To this day, the meanings carried in the painting are hotly debated amongst scholars.

If you zoom in and browse around this image in Linked Canvas, as you traverse the various artefacts that have been identified, the word-cloud on the left of the display changes contextually, and what this reveals in how the symbolic and contextual meanings of those objects and visual details have been identified in the semantic annotations.

An odd feature of this painting is the prominent inclusion in the lower foreground of an anamorphically rendered (highly distorted) skull. (It has been suggested that the painting was designed to be hung on the wall of a staircase, so that someone climbing the stairs would see the skull first of all.) The skull is a symbolic device, a reminder of death or memento mori, a common visual trope of the time. That concept of memento mori is an element within the Getty AAT thesaurus, and the concept has its own URI, which makes it connectable to the outside world.

Dave then turned to Titian’s allegorical painting Bacchus and Ariadne, also from the same period and also from the National Gallery collection, and based on a story from Ovid’s Metamorphoses. In this story, Ariadne, who had helped Theseus find his way in and out of the labyrinth where he slew the Minotaur, and who had become his lover, has been abandoned by Theseus on the island of Naxos (and in the background if you look carefully, you can see his ship sneakily making off). And then along comes the God of Wine, Bacchus, at the head of a procession of revellers and, falling in love with Ariadne at first glance, he leaps from the chariot to rescue and defend her.

Following the semantic links (via the LOD database on Iconography) can take us to other images about the tale of Ariadne on Naxos, such as a fresco from Pompeii, which shows Theseus ascending the gang-plank of his ship while Ariadne sleeps. As Dave remarked, we generate knowledge when we connect different data sets.

Another layer built on top of the Linked Canvas application was the ability to create ‘guided tours’ that walk the viewer around an image, with audio commentary. The example Dave played for us was a commentary on the art within a classical Greek drinking-bowl, explaining the conventions of the symposium (Greek drinking party). Indeed, an image can host multiple such audio commentaries, letting a visitor experience multiple interpretations.

In building this image resource, Synaptica made use of a relatively recent standard called the International Image Interoperability Framework (IIIF). This is a set of standardised application programming interfaces (APIs) for websites that aim to do clever things with images and collections of images. For example, it can be used to load images at appropriate resolutions and croppings, which is useful if you want to start with a fast-loading overview image and then zoom in. The IIIF Search API is used for searching the annotation content of images.

Searching within Linked Canvas is what Dave described as ‘Level Three Panofsky’. You might search on an abstract concept such as ‘love’, and be presented us with a range of details within a range of images, plus links to scholarly articles linked to those.

Post-Truth Forum

As a final example, Dave showed us http://www.posttruthforum.org, which is an ontology of concepts around the ideas of ‘fake news’ and the ‘post-truth’ phenomenon, with thematically organised links out to resources on the Web, in books and in journals. Built by Dave using Synaptica Graphite software, it is Dave’s private project born out of a concern about what information professionals can do as a community to stem the appalling degradation of the quality of information in the news media and social media.

For NetIKX members (and for readers of this post), going to Dave’s Post Truth Forum site is also an opportunity to experience a public Linked Open Data application. People may also want to explore Dave’s thoughts as set out on his blog, www.davidclarke.blog.

Taxonomies vs Graphs

In closing, Dave wanted to show a few example that might feed our traditional post-refreshment round-table discussions. How can we characterise the difference between a taxonomy and a data graph (or ontology)? His first image was an organisation chart, literally a regimented and hierarchical taxonomy (the US Department of Defense and armed forces).

His second image was the ‘tree of life’ diagram, the phylogenetic tree that illustrates how life forms are related to each other, and to common ancestor species. This is also a taxonomy, but with a twist. Here, every intermediate node in the tree not only inherits characteristics from higher up, but also adds new ones. So, mammals have shared characteristics (including suckling young), placental mammals add a few more, and canids such as wolves, jackals and dogs have other extra shared characteristics. (This can get confusing if you rely too much on appearances: hyenas look dog-like, but are actually more closely related to the big cats.)

So the Tree of Life captures systematic differentiation, which a taxonomy typically cannot. However, said Dave, an ontology can. In making an ontology we specify all the classes we need, and can specify the property sets as we go. And, referring back to Dion’s presentation, Dave remarked that while ontologies do not work easily in a relational database structure, they work really well in a graph database. In a graph database you can handle processes as well as things and specify the characteristics of both processes and things.

Dave’s third and final image was of the latest version of the London Underground route diagram. This is a graph, specifically a network diagram, that is characterised not by hierarchy, but by connections. Could this be described in a taxonomy? You’d have to get rid of the Circle line, because taxonomies can’t end up where they started from. With a graph, as with the Underground, you can enter from any direction, and there are all sorts of ways to make connections.

We shouldn’t think of ditching taxonomies; they are excellent for some information management jobs. Ontologies are superior in some applications, but not all. The ideal is to get them working together. It would be a good thought-experiment for the table groups to think about what, in our lives and jobs, are better suited to taxonomic approaches and what would be better served by graphs and ontologies. And, we should think about the vast amounts of data out there in the public domain, and whether our enterprises might benefit from harnessing those resources.


Discussion

Following NetIKX tradition, after a break for refreshments, people again settled down into small table groups. We asked participants to discuss what they had heard and identify either issues they thought worth raising, or thinks that they would like to know more about.

I was chairing the session, and I pointed out that even if we didn’t have time in subsequent discussion to feed everyone’s curiosity, I would do my best to research supplementary information to add to this account which you are reading.

I ran the audio recorder during the plenary discussion, so even though I was not party to what the table groups had discussed internally, I can report with some accuracy what came out of the session. Because the contributions jumped about a bit from topic to topic, I have resequenced them to make them easier for the reader to follow.

AI vs Linked Data and ontologies?

Steve Dale wondered if these efforts to compile graph databases and ontologies was worth it, as he believed Artificial Intelligence is reaching the point where a computer can be thrown all sorts of data – structured and unstructured – and left to figure it out for itself through machine learning algorithms. Later, Stuart Ward expressed a similar opinion. Speaking as a business person, not a software wizard, he wonders if there is anything that he needs to design?

Conrad, in fielding this question, mentioned that on the table he’d been on (Dave Clarke also), they had looked some more into the use in Dave’s examples of Natural Language Understanding; that is a kind of AI component. But they had also discussed the example of the Hieronymous Bosch painting. Dave himself undertook the background research for this and had to swot up by reading a score of scholarly books. In Conrad’s opinion, we would have to wait another millennium before we’d have an AI able to trace the symbolism in Bosch’s visual world. Someone else wondered how one strikes the right balance between the contributions of AI and human effort.

Later, Dave Clarke returned to the question; in his opinion, AI is heavily hyped – though if you want investment, it’s a good buzz-word to throw about! So-called Artificial Intelligence works very well in certain domains, such as pattern recognition, and even with images (example: face recognition in many cameras). But AI is appalling at semantics. At Synaptica, they believe that if you want to create applications using machine intelligence, you must structure your data. Metadata and ontologies are the enablers for smart applications.

Dion responded to Stuart’s question by saying that it would be logical at least to define what your entities are – or at least, to define what counts as an entity, so that software can identify entities and distinguish them from relationships. Conrad said that the ‘predicates’ (relationships) also need defining, and in the Linked Data model this can be assisted if you link out to publicly-available schemas.

Dave added that, these days, in the Linked Data world, it has become pretty easy to adapt your database structures as you go along. Compared to the pain and disruption of trying to modify a relational database, it is easy to add new types of data and new types of query to a Linked Data model, making the initial design process less traumatic and protracted.

Graph databases vs Linked Open Data?

Conrad asked Dave to clarify a remark he had made at table level about the capabilities of a graph database product like Neo4j, compared with Linked Open Data implementations.

Dave explained that Neo4j is indeed a graph database system, but it is not an RDF database or a Linked Data database. When Synaptica started to move from their prior focus on relational databases towards graphical databases, Dave became excited about Neo4j (at first). They got it in, and found it was a wonderfully easy system to develop with. However, because its method of data modelling is not based on RDF, Neo4j was not going to be a solution for working with Linked Data; and so fervently did Dave believe that the future is about sharing knowledge, he pulled the plug on their Neo4j development.

He added that he has no particular axe to grind about which RDF database they should use, but it has to be RDF-conforming. There are both proprietary systems (from Oracle, IBM DB2, OntoText GraphDB, MarkLogic) and open-source systems (3store, ARC2, Apache Jena, RDFLib). He has found that the open-source systems can get you so far, but for large-scale implementations one generally has to dip into the coffers and buy a licence for something heavyweight.

Even if your organisation has no intention to publish data, designing and building as Linked Data lets you support smart data and machine reasoning, and benefit from data imported from Linked Open Data external resources.

Conrad asked Dion to say more about his experiences with graph databases. He said that he had approached Tableau, who had provided him with sample software and sample datasets. He hadn’t yet had a change to engage with them, but would be very happy to report back on what he learns.

Privacy and data protection

Clare Parry raised issues of privacy and data protection. You may have information in your own dataset that does not give much information about people, and you may be compliant with all the data protection legislation. However, if you pull in data from other datasets, and combine them, you could end up inferring quite a lot more information about an individual.

(I suppose the answer here is to do with controlling which kinds of datasets are allowed to be open. We are on all manner of databases, sometimes without suspecting it. A motor car’s registration details are held by DVLA, and Transport for London; the police and TfL use ANPR technology to tie vehicles to locations; our banks have details of our debit card transactions and, if we use those cards to pay for bus journeys, that also geolocates us. These are examples of datasets that by ‘triangulation’ could identify more about us than we would like.)

URI, URL, URN

Graham Robertson reported that on his table they discussed what the difference is between URLs and URIs…

(If I may attempt an explanation: the wider term is URI, Uniform Resource Identifier. It is ‘uniform’ because everybody is supposed to use it the same way, and it is supposed uniquely and unambiguously to identify anything which might be called a ‘resource’. The Uniform Resource Locator (URL) is the most common sub-type of URI, which says where a resource can be found on the Web.

But there can be other kinds of resource identifiers: the URN (Uniform Resource Name) identifies a resource that can be referenced within a controlled namespace. Wikipedia gives as an example ISBN 0-486-27557-4, which refers to a specific edition of Shakespeare’s Romeo and Juliet. In the MeSH schema of medical subject headings, the code D004617 refers to ‘embolism’.)

Trustworthiness

Some people had discussed the issue of the trustworthiness of external data sources to which one might link – Wikipedia (and WikiData and DBpedia) among them, and Conrad later asked Mandy  to say more about this. She wondered about the wisdom of relying on data which you can’t verify, and which may have been crowdsourced. But Dave has pointed out that you might have alternative authorities that you can point to. Conrad thought that for some serious applications one would want to consult experts, which is how the Getty AAT has been built up. Knowing provenance, added David Penfold, is very important.

The librarians ask: ontologies vs taxonomies?

Rob Rosset’s table was awash with librarians, who tend to have an understanding about what is a taxonomy and what an ontology. How did Dave Clarke see this, he asked?

Dave referred back to his closing three slides. The organisational chart he had shown is a strict hierarchy, and that is how taxonomies are structured. The diagram of the Tree of Life is an interesting hybrid, because it is both taxonomic and ontological in nature. There are things that mammals have in common, related characteristics, which are different from what other groupings such as reptiles would have.

But we shouldn’t think about abandoning taxonomy in favour of ontology. There will be times where you want to explore things top-down (taxonomically), and other cases where you might want to explore things from different directions.

What is nice about Linked Data is that it is built on standards that support these things. In the W3C world, there is the SKOS standard, Simple Knowledge Organization Systems, very light and simple, and there to help you build a taxonomy. And then there is OWL, the Web Ontology Language, which will help you ascend to another level of specificity. And in fact, SKOS itself is an ontology.

Closing thoughts and resources

This afternoon was a useful and lively introduction to the overlapping concepts of Graph Databases and Linked Data, and I hope that the above account helps refresh the memories of those who attended, and engage the minds of those who didn’t. Please note that in writing this I have ‘smuggled in’ additionally-researched explanations and examples, to help clarify matters.

Later in the year, NetIKX is planning a meeting all about Ontologies, which will be a way to look at these information and knowledge management approaches from a different direction. Readers may also like to read my illustrated account of a lecture on Ontologies and the Semantic Web, which was given by Professor Ian Horrocks to a British Computer Society audience in 2005. That is still available as a PDF from http://www.conradiator.com/resources/pdf/Horrocks_needham2005.pdf

Ontologies, taxonomies and knowledge organisation systems are meat and drink to the UK Chapter of the International Society for Knowledge Organization (ISKO UK), and in September 2010 ISKO UK held a full day conference on Linked Data: the future of knowledge organization on the Web. There were nine speakers and a closing panel session, and the audio recordings are all available on the ISKO UK Web site, at http://www.iskouk.org/content/linked-data-future-knowledge-organization-web

Recently, the Neo4j team produced a book by Ian Robinson, Jim Webber and Emil Eifrem called ‘Graph Databases’, and it is available for free (PDF, Kindle etc) from https://neo4j.com/graph-databases-book/ Or you can get it published in dead-tree form from O’Reilly Books. See https://www.amazon.co.uk/Graph-Databases-Ian-Robinson/dp/1449356265

Hillsborough : Information Work and Helping People – July 21 2015

Jan Parry , CILIP’s President, gave a talk to NetIKX at the British Dental Association on the Hillsborough Disaster in 1989 and her role with the Hillsborough Independent Panel set up in 2009 to oversee the release of documents arising from the tragedy in which 96 people lost their lives at an FA Cup semi-final between Liverpool and Nottingham Forest held at Hillsborough, the home ground of Sheffield Wednesday FC. It was a very thought provoking talk which was received in near silence. Jan began by outlining the previous signals of potential disaster that had occurred in the 1980’s when serious ‘crushing’ incidents took place in the “pens” – standing areas in front of the West Stand accessed by gates on Lepping Lanes. She then talked about the day in question where Liverpool fans arrived late after being delayed by roadworks on the M62, traffic flowed along Leppings Lane until 38 minutes before the kick off and there was no managed queues at the turnstiles. The “pens”  were full 10 minutes before the match started. There was a lack of signs and stewarding to direct fans to other standing areas. At 3:00pm  crowds were still outside the turnstiles and the police Chief Superintendant in charge – who had been appointed to oversee policing on the day a little before the semi-final event itself –  gave an order to open the gates. There was a rush of fans towards the “pens” – people at the front were pushed forward, crushing and fatalities took place quickly. At 3:06pm the game was stopped. Then there was a Police Control Box Meeting at 3:15pm. The gymnasium became a temporary mortuary and witness statements started to be taken.

Official investigations began – Lord Justice Taylor (1990); West Midlands Police investigated South Yorks Police (1990); The Inquest (1990); Lord Justice Smith Scrutiny (1998). On the 20th anniversary memorial Andy Burnham (then a government minister) called for the early release of all documents. The Hillsborough Independent Panel was set up. Jan’s role was to undertake research and families disclosure :- oversee document discovery; manage information; consult the families. It began with finding family information – there were 3 established groups of families and all the other families as well.

There were lots of issues. Significantly, there had been a big impact on the mental health of the families involved in the tragedy. Also, regarding documents – that is, getting hold of them, it needed real persuasion to obtain them. Following on from that the documents had to be scanned, digitised, catalogued and redacted on a secure system. This called for researchers with medical knowledge too. What came out of this great exercise ?

In essence, the last valid safety certificate for the football stadium was issued in 1979; the code word for a “major incident” was never used; there was poor communication between ALL agencies; there was minimal medical treatment at the ground; witness statements had been changed; information on “The Sun’s” notorious leading article was obtained. Having achieved so much a disclosure day was put in the calendar – 12th September 2012. Again, the families were put first and informed that 41 victims could have lived.

On Disclosure Day itself PM David Cameron publicly apologised for the tragedy. The report was put on the website. Note that this website is a permanent archive for the documents : http://hillsborough.independent.gov.uk Disclosure had quite an impact – Sir Norman Bettinson (Chief Constable of South York at the time of the tragedy) resigned; the original inquests were quashed. Now there are new inquests and inquiries. Lord Justice Golding started a new Inquest in March 2014. There is an IPCC investigation and a Police investigation into misconduct or criminal behaviour by police officers post-tragedy. Coroners Rules 1984 have been tightened up regarding consistency of classes of documents. Police Force records have been put under legislative control. Crucially, for the families and Information Professionals records discovery and information management delivered the truth.

Jan showed a couple of video clips during her talk these are available from the Report pages online but you need to scroll down to the bottom of the page :

http://hillsborough.independent.gov.uk/report/main-section/part-1/page-4/

http://hillsborough.independent.gov.uk/report/main-section/part-1/page-7/

 

 

 

 

 

 

 

Rob Rosset

 

 

 

 

Seek and you will find? Wednesday 18th March 2015

We had two excellent speakers for our Seminar on 18th March, entitled “Search and you will find?” Karen Blakeman and Tony Hirst. The question mark in the title was deliberate, since the underlying message was that search and discovery might sometimes throw up the unexpected.

Learning objectives for the day were:

  • To understand the commercial, social and regulatory influences that have (or will) influence Google search engine results.
  • To be able to apply new search behaviours that will improve accuracy and relevance of search results.
  • An appreciation of data mining and data discovery techniques and the risks involved in using them, as well as the education and skills required for their disciplined and ethical use

Karen Blakeman delivered an informative and thought-provoking talk about our possibly misplaced reliance on Google search results. She discussed how Google is undergoing major changes in the way it analyses our searches and presents results, which are influenced by what we’ve searched for previously and information pulled from our social media circles. She also covered how EU regulations are dictating what the likes of Google can and cannot display in their results.

Amongst many examples that Karen gave of imperfect search results, this one of Henry VIII’s wives stood out – note the image of Jane Seymour, where Google has sourced the image of the actress Jane Seymour.

Blog image re Jane Seymour

This is an obvious and easily spotted error, others are far subtler, and probably go unnoticed by the vast majority of search users. The problem, as Karen explained, is that Google does not always provide attribution for where it is sourcing its results, and where attribution is provided, the user must (or should) decide whether this is a reliable or authoritative source. Users beware if searching for medical or allergy symptoms; the sources can be arbitrary and not necessarily from authoritative medical websites. It would appear that Google’s algorithms decide what is scientific fact and what is aggregated opinion!

The clear message was to use Google as a filter to point us to likely answers to our queries, but to apply more detailed analysis of the search results before assuming the information is correct.

Karen’s slides are available at:  http://www.rba.co.uk/as/

Tony Hirst gave us an introduction into the world of data analytics and data visualisation and challenges of abstracting meaning from large datasets. Techniques such as data mining and knowledge discovery in databases (KDD) use machine learning and powerful statistics to help us discover new insights from ever-larger datasets. Tony gave us an insight into some of the analytical techniques and the risks associated with using them. In particular, if we leave decision making up to machines and the algorithms inside them, are we introducing new forms of bias that human decision makers might avoid? What do we, as practitioners need to know in order to use these tools in a responsible way?

As Tony explained, the most effective data analysis comes down to discovering relationships and patterns that would otherwise be missed by looking at just one dataset in isolation, or analysing data in ranked lists.  Multifaceted data analysis, using – for example – datasets applied to maps, can give unique visualisations and more insightful sense making.

Amongst many other techniques, Tony discussed Concordance Correlation, Lexical Dispersion, Partial (Fuzzy) String Matching and Anscombe’s Quartet.

Tony’s slides will be available at: http://www.slideshare.net/psychemedia

Following the keynote presentations from Karen and Tony, the following questions were put to the delegates:

  • How can organisations ensure their staff is using (external) search engines effectively?
  • How do you determine the value of search in terms of accuracy, time, and cost?
  • If I wanted to know how to use data visualisation and data analysis tools, where do I go? Who do I ask?

 

The delegates moved into three groups to discuss and respond to these questions (one group per question). The plenary feedback as follows:

Group 1 – How can organisations ensure their staff is using (external) search engines effectively?

  • Ban them from using Google
  • More training
  • Employ specialists to do research
  • Use subscription services
  • Change the educations system.

Group 2 – How do you determine the value of search in terms of accuracy, time, and cost?

  • Cost and Time are variable
  • Accuracy is the most important criterion
  • Differentiate between “value” and “cost”

Group 3 – If I wanted to know how to use data visualisation and data analysis tools, where do I go? Who do I ask?

Lastly, we’d like to thank our speakers and the delegates for making this such an interesting, educational and engaging seminar.

Karen Blakeman (@karenblakeman) is an independent consultant providing a wide range of organisations with training, help and advice on how to search more effectively, how to use social and collaborative tools for research, and how to assess and manage information. Prior to setting up her own company Karen worked in the pharmaceutical and healthcare industry, and for the international management consultancy group Strategic Planning Associates. Her website is at www.rba.co.uk <http://www.rba.co.uk/> and her blog at www.rba.co.uk/wordpress/<http://www.rba.co.uk/wordpress/>.

Tony Hirst (@psychemedia) is a lecturer in the Department of Computing and Communications at the Open University, where he has authored course material on Artificial Intelligence and Robotics, Information Skills, Data Analysis and Visualisation, and a Data Storyteller with the Open Knowledge School of Data. An open data advocate and Formula One data junkie, he blogs regularly on matters relating to social network analysis, data visualisation, open education and open data policy at blog.ouseful.info

Steve Dale
20/03/15

 

 

 

 

Business Information Review is seeking a new editor

Business Information Review is seeking a new editor to replace Val Skelton and Sandra Ward from the end of March/Early April next Year. They will have completed five years of editing by then – and they think it’s time to hand over what is fun, exciting and challenging! Due to the decision of Val Skelton and Sandra Ward to complete their joint editorship of Business Information Review in March/April 2015, Sage Publications would like to find replacement editor(s). Val and Sandra have job shared the editorship. Details of the post, which is remunerated, and how to apply for it can be found at : http://bir.sagepub.com/site/includefiles/BIR%20Call%20for%20Editor%28s%29.pdf

Val and Sandra are happy to answer queries about the post. Contact :

 

Communities of Practice for the Post Recession Environment Tuesday 16th September 2014

35 people attended this Event at the British Dental Association in Wimpole Street. Our speaker was Dion Lindsay of Dion Lindsay Consulting : http://www.linkedin.com/pub/dion-lindsay/3/832/920 . Dion tackled big questions in his presentation. Are the principles established for successful Communities of Practice (CoP’s) in the 1990’s and earlier still sound today ? AND what new principles and good practices are emerging as social media and other channels of communication become part of the operational infrastructure that we all inhabit ? Dion started of with a couple of definitions. He explained the characteristics of CoP’s. In essence it begins with ‘practice’. Practitioners who discuss and post about practical problems. Practitioners who suggest solutions and develop practice. These solutions are at the practical level. Hence, competence at individual and corporate level is increased.  It continues with collaboration – the development of competence in an environment short of money ! He instanced the Motor Neurone Disease Association (MNDA) where he had developed an electronic discussion board in the 1990’s. In 1998 this electronic discussion board was taken over by University College London (UCL) and became an electronic discussion forum. It had cumulated 40,0000 posts. An analysis showed that the forum splits 80% moral support and 20% problem solving in terms of posts.

How about Communities of Interest (CoI’s) ? These are all about people who share an identity. They have a shared voice and conduct a shared activity. So ‘identity’ is a critical characteristic Also, there is an ongoing discussion about interests, an ongoing organisation of events and an interest in problems and solutions. This can take place in the workplace or in the public arena. Now to differentiate CoP’s from CoI’s. CoP’s get most attention in the workplace. CoI’s – there most serious work is detached from the workplace. There is a dearth of literature on this.

Success factors for CoP’s :  A successful CoP must be a physical community / A successful CoP must not have management setting the Agenda / To be successful CoP’s must have recognisable outcomes / Treat CoP discussions as conversations. Just taking the recognisable outcomes aspect it is necessary to emphasise that ‘the knowledge as it is created must be communicated’. In @ 2005 Shell and MNDA () reported similar findings in creating a Knowledge Base from CoP outcomes :  Cost :- 20% (30%). Value :- 85% (90%). Compare to standard  Knowledge Base stats : Cost :- 80% (70%). Value :- 15% (10%). These figures speak for themselves.  So we can sum up the reasons for a revival in interest for CoP’s as follows : Cost pressure on training and formal means of development in the workplace / collaboration and social media are accustoming organisations to non-structured working / the need to find ways of keeping employees engaged / technology for discussion forums is less of a challenge.

Dion concluded his talk by saying that ‘you really have to want  to do it’ to run a successful CoP. There is a benefit in commencing. There must be proper facilitation. There must be adherence to best management practice. A CoP is, in reality, a ‘Community of Commitment’. It fits in very well indeed with project management.

Graham Robertson – a NetIKX ManCom Member – then gave a brief history of NetIKX going back many, many years to when it started up at Aslib. Lissi Corfield – another NetIKX ManCom Member – spoke about our current ideas at NetIKX to take things forward as people are not coming along to meetings as frequently as they used to do. She talked about building resources in Information Management and Knowledge Management on the website and publicising and, indeed, interacting with our group on LinkedIN. Both Graham and Lissi are practitioners in Knowledge Management.

Under Lissi’s supervision we then broke up and started syndicate sessions at the close of which each syndicate reported back to the meeting. The main points are highlighted below.

Syndicate 1 : How to gain management support for CoP’s – the fears and successes.

 

  • Fear may be seen as presenting formal advice.
  • Encourage openness with no anonymity.
  • Resource of sharing policy together.
  • Each table is its own CoP.

Syndicate 2 : How do you become involved in existing CoP’s ? Should you bother ?

  • Senior actors are already connected.
  • Impose / grow organically.
  • Cross organisation / grows out of a need.
  • Can we learn from Quality Circles ?

Syndicate 3 : What is a good moderator ?

  • Challenging
  • Active/passive
  • Online/in person
  • CoP/CoI
  • Ground rules
  • FAQ’s/steering friendly discussion
  • Energy
  • LinkedIN

Syndicate 4 : Developing IM and KM resources for the NetIKX website

Valuable contributions were made by David Penfold, Martin Newman and Conrad Taylor.

Robert Rosset input suggestions of individuals and organisations from whom NetIKX had learned on the WIKI page of the website.  Rather like potter’s clay it needs to be worked into shape. An ounce of practice is worth a ton of theory.

Rob Rosset 22/09/15

 

 

 

 

 

Selling Taxonomies to organisations, Thursday July 3 2014

Blog for NetIKX  July 3rd 2014  Whatever happened to Margate?

The NetIKX meeting this month was highly popular.  I thought a session on Taxonomy might be considered dull, but I guess the hook was in the title: ‘making the business case for taxonomy’.  The session did provide great ideas for making a business case for an organisational taxonomy project, and the ideas were suitable for other contexts where direct quantifiable benefit will not be an output of the project and so immediate impact on ROI is not a simple computation.

There were two case studies presented.  The first from ‘Catalogue Queen’ Alice Laird, (ICAEW), faced the business case quandary head on.  How did they get hard headed finance to budget for their taxonomy plans?  The winning move here was to show in small scale the value of the work.  People in the business realised that the library micro-site was the best place to find things and asked why this was so.  The knowledge management team were able to demonstrate how the taxonomy could increase organisational efficiency and so helped prove the case to all website users.

This case study also provided tips for running a taxonomy project.  They used a working group from the body of the organisation, but kept the team small to ensure each person involved was clear about the relevance of the project to them and their team.  They also made the project stages clear: a consultation stage might show where there were contradictions and confusion, and so there was a following stage where the people with appropriate expertise would to step in to make firm decisions.  By setting out the stages clearly, they avoided protracted discussion and also made good use of the skills already available within their team.  In this way they fully exploited their assets! All in all, it was good to hear a crisp report about a well organised project, and we all wish them luck for their imminent implementation.

The second case study looked at using a taxonomy to help share data between different organizations in the UK Heritage sector.  In a talk called ‘Reclassify the Past’, Phil Carlisle (English Heritage) entertained us, explaining a particular problem that fuelled the need for a taxonomy project.  At one point, although the classification system worked well in most respects, some vital geographic data was not included.  As a result, a search on, for example, Margate came up with a blank, even though the data was in there.  The danger was of reputation loss – particularly with people living in Margate!  Highlighting this type of blip was another useful way to sell a structured taxonomy project.  Search, even with a good search engine is more complex than many people realise and poorly organised metadata can cause problems that ‘Google it!’ may not solve.

This case study also provided an interesting operational tip.  In order to create the best platform for sharing, this team gave away the software they were using to others in the field, as the cost was outweighed by the overall benefit of standardisation.

The session ended with a lively set of discussions.  I was with a group trying to identify more closely how a taxonomy should be classified: animal, vegetable or mineral? We found some paradoxes to play with.  For example, does a taxonomy work as a device to structure data or is a structure already in place, the basis for the taxonomy?

To conclude, it was ironic that one of the speakers commented jokingly, ‘there’s no gratitude!’  Fair comment, as basic information infrastructure projects do not usually attract riveted attention. But, at this meeting at least, where taxonomies are loved and cared for, and business case tips are welcomed, the speakers could rely on full appreciation and gratitude from a very attentive audience.

Lissi Corfield (posted by robrosset)

Graham Robertson giving feedback on his group's discussions

Graham Robertson giving feedback on his group’s discussions

IMG_3670

Steve Dale summarising his group’s discussions

Information on the Move Seminar Tuesday May 13th Part 2

Max Whitby of Touch Press http//www.touchpress.com came to talk to @30 people attending the NetIKX seminar at the British Dental Association in Wimpole Street, following on from David Nicholas (see related blog Part 1). Max’s company specialises in creating apps which are interactive and provide information or assist in education. In other words, these apps have a point, they are not games. They have created an app of  ‘The Periodic Table’ and ‘The Solar System’ and ‘The Orchestra’. Users spend hours looking, listening and reading the annotation on these apps. For example, on the app for T.S. Eliot’s great poem “The Wasteland” , there are multiple readers including Fiona Shaw, Alec Guinness and T.S. Eliot. Three of their music apps have been nominated for an award from the Royal Philharmonic Society. Max displayed a couple of the apps on screen – one in particular caught my attention – ‘The Orchestra’. This features the instruments (looking at each instrument from every angle); the music (including the score); the conductor. Amazing.

Following on from Max’s talk we had refreshments and then divided up into two syndicate groups. These working groups addressed two different issues. “1) Taking an example of the rich functionality and content of the Touch Press app, think of an app that your organisation could develop that would engage and/or educate and/or inform its users/customers”. Syndicate 1 came up with five ideas. Members from the Ministry of Justice suggested an information app for internal use within the Ministry. This app could identify all the things that policy makers needed to know (to connect with) in order to produce proper policy. The current tools are paper documents, documents held by records management or information controlled by external contractors. It is a question of packaging up such tools and presenting them in a uniform but innovative way on an app. Members from the Institute of Energy suggested an educational app. On their current website is an interactive matrix demonstrating “The Energy Chain”. It is linked to an offsite database (massive)  held in a separate location. An app could have one part of the database in order to describe “The Energy Landscape” (a mixture of visual/text/statistics). It could be used by anyone: researchers, students, members of the public. Attendees from the Medical Defence Union came up with an app about things to avoid, in terms of risk mitigation for medical professionals. Another attendee from the Department of Health suggested two apps – one about how the body functions, with different levels of knowledge, so it can be used by health professionals and members of the public; the other app to address the issue of IT Support. This would cover everything to do with Service Management from issues with suppliers to logging all support calls in one place. It was believed that such apps would offer a richer experience than textbooks or documents.

Syndicate 2 dealt with the question “What is the role of the information professional in a disintermediated, information rich world.” They came up with the idea for today’s Information Professionals to go out into the market place. Information Professionals are competing with IT people who have no background or skills in information management. The talk was about trust and embracing traditional skills of quality assurance and quality control so that information is trusted. Such an approach calls for advocates who are very relevant for the organisation in question. Librarians were once embedded in certain organisations (like the pharmaceutical industry) but not today. This syndicate focus was on disintermediation rather than ‘information on the go’.

Steve Dale wrapped up the syndicate sessions by stating that there was always a need to evaluate the information we receive – we can’t rely on algorithms, which can be degraded. The Syndicate Sessions ended and the attendees enjoyed a glass of wine (or two) and nibbles. It was a most successful seminar. Our thanks to NetIKX ManCom for organising the Event and in particular to Suzanne Burge, Melanie Harris, Anoja Fernando and Steve Dale for running the Event on the day.

rob rosset

Information on the Move – Seminar held on Tuesday May 13th 2014 – Part 1

David Nicholas came to talk to a group of @30 NetIKX members at the h.q. of the British Dental Association in Wimpole Street. David runs CIBER a pan-European research outfit : http//ciber-research.eu He spoke about ‘The second digital transition’ which means that there will be no librarians (as we know them) by 2022. ‘The first digital revolution’ brought librarianship to its knees. This one will finish it off. It is ‘the end of culture as we know it’. ‘The first digital revolution’ took place in the office or in the library. The device – the pc – was desk bound, office bound. ‘The second digital revolution’ is taking place in the street. Mobile is now the main platform for accessing the web. Mobile means meeting information needs at the time of need. Mobiles provide access to masses of information for everyone. Smartphones and social media stride major information worlds, informal and formal.Mobiles empower digital consumer purchasing. Mobiles are fast. Mobiles are smaller devices with small screens.  They are not computational devices but access devices. Mobiles are social, personal, cool and popular.

Here are the basic characteristics of digital information seeking behaviour: ‘hyperactive’ – users love choice and looking; ‘bouncers’ – 1-2 pages from thousands; ‘promiscuous’ – about 40% don’t come back; ‘one slots’ – one visit, one page. Why is this ? Because of search engine lists/massive and changing choice/so much rubbish out there/poor retrieval skills (2.2 words per query)/multi-tasking (more pleasurable doing several things at once)/end user checking, so no memories in cyberspace and very high ‘churn rate’. The horizontal has replaced the vertical, reading is ‘out’ fast ‘media’ is in. Information seeking wise ‘skitter’ – power browse. Consequences ? Abstracts have never been so popular/scholars go online to avoid reading, prefer visual/few minutes per visit; 15 minutes is a long time/ shorter articles have a much bigger chance of being used.

Europeana mobile use : http://www.europeana.eu/ 130,000 unique mobile users accessed Europeana in last six months. Characteristics : ‘information light’, visits from mobiles much less interactive, few records, searches, less time on a visit/differences between devices (iPhone – abbreviated behaviour on part of searchers; iPad – behaviour conforms to that of pc users)/mobile use peaks at nights and weekends (desk tops peak on Wednesday and late afternoons)/searching and reading has moved into the social space. We could not have come further from the initial concept of libraries : no walls, no queuing, no intermediaries! Ask any young person about a library and they will point to their mobile. It is ironic that mobiles were once banned from libraries – now it is the library. The mobile, borderless information environment really challenges libraries and publishers. It constitutes another massive round of disintermediation and migration. The changed platform and environment transforms information consumption. Final reflection : Is the web and the mobile device making us stupid ? Where are we going with information, learning and mobile devices ?

robrosset

 

 

 

 

Event report: From data and information to knowledge: the Web of tomorrow – a talk by Dr Serge Abiteboul

Some notes taken at the Milner Award Lecture by Dr Serge Abiteboul for the Royal Society on 12th November, From data and information to knowledge: the Web of tomorrow. Dr Abiteboul was awarded the 2013 Milner Award, given annually for outstanding achievement in computer science by a European researcher.

Serge Abiteboul

Dr Abiteoul’s research work focuses mainly on data, information and knowledge management, particularly on the Web. Like NetIKX members, he is interested in the transition from data to knowledge. Among many prestigious projects, he has worked on Apple’s Siri interface and Active XML, a declarative framework that harnesses web services for data integration.

In a charming French accent, he explained to us that he was going to talk about networks – networks of machines (Internet), of content (Web) and people (social media).

Nowdays information is everywhere, worldwide. Everything is big and getting bigger – the size of the digital world is estimated to be doubling every 18 months. A web search engine now is a cluster of machines – maybe a million machines. In the past getting ten machines to work together was a big challenge! Engineering achievements have enabled hundreds of thousands of computers to work together.

Dr Abiteoul’s assumptions

1. The size will continue to grow
2. The information will continue to be managed by many systems (rather than a company like Facebook taking over all the world’s information).
3. These systems will be intelligent – in the sense that they produce/consume knowledge and not simply raw data.

The 4 + 1 V’s of Big Data…

Volume, Velocity, Variety, Veracity = four difficulties of big data. There is a huge mass of data, more than can be retrieved. And it is changing fast, particularly sets of data like the stock market. Furthermore, the information on the web is uncertain, full of imprecisions and contradictions. Search engines must contend with lies and opinions, not just facts.

Dr  Abiteoul’s +1 is Value – the bottom line is, what value comes from all this data? How does a computer decide what is important to present?

Data analysis is a technical challenge as old as computer science. We know how to do it with a small amount of data; the next challenge is to do it with a huge amount. Complex algorithms will have to be designed. These will need to do low level statistical analysis, because finding the perfect statistics will take too long. Maths, informatics, engineering and hardware are all needed.

But of the tree of the knowledge of good and evil, thou shalt not eat of it: for in the day that thou eatest thereof thou shalt surely die. (Genesis 2.17)

People often prefer being given one answer rather than a multitude of options to sort through. When we ask another person an answer, they don’t reply by giving us twenty pages to read through, so why should we interact with machines (search engines) like that? (Note – should information professionals be very selective and choosy with the information we put forward to customers, would they prefer a reading list of five books rather than twenty?).

Machines prefer formatted knowledge, logical statements. Machines can be programmed to find patterns – e.g. Woody Allen ‘is married to’ Soon-Yi Previn. But people write that two people are married in many different ways. How does a search engine cope with all the false statements and contradictions, e.g. ‘Elvis Presley died on 16 August 1977’ and ‘The King is alive’!

The real problem with the accuracy of Wikipedia is not incorrect amateurs but paid professionals with their own agenda, paid by companies to take a particular viewpoint.

The difficulty is when to stop searching – when to find just enough right answers. Precision, the fraction of results that are correct, must be balanced between the amount of results retrieved. There is a trade off between finding more knowledge and finding the correct knowledge. Machines will have to be programmed to separate the wheat from the chaff. Knowing the good sources, the trustable sources, is a huge advantage for this.

Serendipity

Next, Dr Abiteoul mentioned librarians! He praised the way that a librarian may suggest you read an article that transforms your research. Or you may hear by chance a song that totally obsesses you. Computers lack this serendipity – they’re square. Information professionals take heart: there is value in chance, in browsing shelves, in the ability of your brain to make suggestions computers wouldn’t.

Hyperamnesia

We cannot archive all the data we produce – there’s a lack of storage resources. How do we choose what we keep? The British Library is tackling this question through its UK Web Archive project, which involves archiving 4.8 million UK websites and one billion web pages.

The BL Web Archiving page says: “We are developing a new crowd sourcing application that will use Twitter to support an automated selection process. We envisage that in the future, automated selection of this sort will compliment manual selection by subject experts, resulting in a more representative and well-rounded set of collections.” So perhaps the web of the future will need both expert people and star computing systems.

The decisions of machines

Decisions are increasingly made by machines. For instance, automated transport systems like the Docklands Light Railway, or auto trading on the stock market. How far do we go with this, asked Dr Abiteoul. Would a machine be allowed to decide that someone is a terrorist and kill them, and if so at what level of certainty? At 90% sure? At 95% sure?

Soon machines will acquire knowledge for us, remember knowledge for us, reason for us. We should get prepared by learning informatics, so that we understand them.

There were so many ideas flying about that I was unable to note them all down! Luckily the whole lecture is freely available to watch at www.youtube.com/watch?v=to9_Xc9f96E.

Blog post by Emily Heath.

Digital native or digital immigrant – does it matter?

Karen Blakeman and Graham Coult, 28 January 2013 #NetIKX59

The first seminar of NetIKX’s new 2013-2015 programme looked at the issues we all face in a technology-driven world.  It combined two of our key themes: harnessing the web for information and knowledge exchange, and developing and exploiting information and knowledge assets and resources.

Karen Blakeman – RBA Information Services
‘Born digital: time for a rethink’

As Karen reminded us, the phrase ‘Digital immigrants’ can be traced back to Marc Prensky’s paper, ‘Digital natives, Digital Immigrants’, 2001. This paper is free to download and there is also a follow-up Part2 paper. Prensky made the argument that the US education system was no longer fit for purpose for a younger generation born with new technologies exploding around them.

Karen Blakeman speaking.

Karen Blakeman speaking.

Pre-internet, many information professionals were using subscription databases with no graphical interfaces. A lot of asking people we knew or asking other professional institutes was done back then. In contrast a wide range of innovative, imaginative search interfaces exist now:

  • ChemSpider – a free chemicals database which lets you search on a graphic, or even draw a chemical structure yourself and search on it. “Wonderful!” said Karen.
  • Mendeley –  a useful specialist search engine to find specific  forms of information, for instance patents, hearings, television broadcasts or computer programs.
  • WorldWideScience – pulls together information from a wide range of science websites and presents them in a visually appealing way.
  • masswerk.at/google60/ – an amusing punch card style mock-up of what Google would have looked like in the 1960’s.

Karen believes that the ‘digital native’ or ‘digital immigrant’ labels are not helpful and “we have far more useful things to worry about”! Using Google effectively, producing good digital photos – none of this comes naturally to any of us – we have to learn.

The major issue for many of us is not going to be the technical side of using technology but the cost, which could lead to poorer people and those living in remote areas being excluded. Many parts of the UK still do not have broadband.

School homework is often internet based now, with students expected to carry out research online – more difficult for children who have slow internet at home or no internet access at all.

Under new government policy rules, jobseekers will soon be forced to sign up online with a job seeker’s website named Universal Jobmatch, or face losing their benefits (see this Guardian article, ‘Unemployed to be forced to use government job website’. Those without internet can use their local library – unless, of course, the library has been closed down!

The Millennials may know how to use social media, but perhaps not in a work context. We tend to have an expectation that just by using the internet regularly, the younger generation have absorbed excellent web analysis and communication skills. This is not always the case. University lecturers often report that their students lack awareness of how to assess the validity of sources and construct their own argument in an essay. Perhaps the sheer amount of information available online has resulted in too much spoon-feeding.

Ultimately Karen believes that it’s your attitude to technology that matters, not what technology you were brought up with. It’s down to personality – your level of curiosity and happiness to explore, an individual thing rather than an age thing. This is demonstrated by an interview on the BBC website with a pensioner who enjoys gaming – ‘Computer games keep me mentally active’.

Karen’s presentation is available at slideshare.net/KarenBlakeman/born-digital.

Graham Coult, Editor-in-Chief, Managing Information
‘Research behaviours: the evidence base’

In support of Karen’s talk, Graham gave us an overview of research which has been undertaken into research behaviours – “Karen was the main course, I’m the pudding”. He told us he would present a “selection, even a miscellany, not exhaustive” of relevant research, taken from Emerald and ASLIB’s database of research articles.

Social media at the university: a demographic comparison’. Alice B. Ruleman, University of Central Missouri, US (2012)

In this study, Ruleman analysed the demographic differences between faculty staff and students in terms of their social media use. She found that social media is by no means a youthful obsession, with both staff and students being active users of social media, just in different ways.

Graham Coult speaking.

Graham Coult speaking.

Kilian, T., Hennigs, N. and Langner, S. (2012), “Do Millennials read books or blogs? Introducing a media usage typology of the internet generation”, Journal of Consumer Marketing, Vol. 29, No. 2, pp.114 – 124. ISSN: 0736-3761.

The author of this study sought to add to the relatively small amount of empirical research done so far on the social media use of the “Internet Generation”. They found that although social media use amongst Milliennials is generally high, Milliennials as a group are not homogeneous in their online behaviour. Using a large-scale empirical study with over 800 participants, the authors identified three different subgroups of Millennials:

  • ‘Restrained’ – relatively low tech savvy, low social media usage group
  • ‘Entertainment seeking’ – the biggest group. Using social media for entertainment, but consuming passively, rather than creating new content themselves.
  • ‘Highly connected’ – the smallest group, predominantly male, busy creating content such as blogs or videos, leading a very active digital life.

Perhaps surprisingly, ‘information seeking’ was the main reason the surveyed Milliennials gave for using social media. Facebook are planning to enhance their search capabilities through their new Facebook Search service. Who needs Google+ or indeed Google if Facebook does search? This could create a situation where large groups of Facebook users never search outside Facebook.

Vandi, C. and Djebbari, E. (2011),”How to create new services between library resources, museum exhibitions and virtual collections”, Library Hi Tech News, Vol. 28 No. 2, pp.15 – 19. ISSN: 0741-9058.

This paper discusses lots of ways to link up traditional sources using mobile technologies. There is evidence that new technologies (mobile etc) can increase use of “traditional” library services in unforeseen ways.

Graham’s conclusions:

  • There is still a great need for a trusted intermediary such as an experienced information professional. This need has probably increased rather than reduced.
  • Lack of access to technology, and lack of skill in its use, will increase disadvantages for certain user groups.
  • Editing and curating, picking out the best quality information, is likely to become a sought-after skill as information overload increases.

Graham’s presentation is available to NetIKX members at www.netikx.org.

Related links:

By Emily Heath