turn off the lights, please

SameAs4J: little drops of water make the mighty ocean

July 5, 2010

Few days ago Milan Stankovich contacted the Sindice crew informing us that he wrote a simply Java library to interact with the public Sindice HTTP APIs. We always appreciate such kind of community efforts lead to collaboratively make Sindice a better place on the Web. Agreeing with Milan, we decided to put some efforts on his initial work to make such library the official open source tool for Java programmers. Read the rest of this entry »

Posted in rdf, semantic web | Tagged java, linked data, open source, semantic web | Leave a Comment »

FBK, Any23 and my involvement in Sindice.com

February 20, 2010

After almost two years spent working at Asemantics, I left it to join the Fondazione Bruno Kessler (FBK), a quite large research institute based in Trento.

These last two years have been amazing: I met very skilled and enthusiastic people working with them on a broad set of different technologies. Every day spent there has been an opportunity for me to learn something new from them, and at the very end they are now very good friends more than colleagues. Now Asemantics is part of the bigger Pro-netics Group.

Moved from Rome, I decided to follow Giovanni Tummarello and Michele Mostarda to launch from scratch a new research unit at FBK called “Web of Data”. FBK is a well-established organization with several units acting on a plethora of different research fields. Every day there is the opportunity to join workshops and other kind of events.

Just to give you an idea of how the things work here, in the April 2009 David Orban gave a talk here on “The Open Internet of Things” attended by a large number of researchers and students. Aside FBK, in Trento there is a quite active community hanging out around the Semantic Web.

“The Semantic Valley”, that’s how they call this euphoric movement around these technologies.

Back to me, the new “Web of Data” unit has joined the Sindice.com army and the last minute release of Any23 0.2 is only the first outcome of this joint effort on the Semantic Web Index between DERI and FBK.

In particularly, the Any23 0.2 release has been my first task here. It’s library, a service, an RDF distiller. It’s used on board the Sindice ingestion pipeline, it’s publicly available here and yesterday I spent a couple of minutes to write this simple bookmarklet:

javascript:window.open(‘http://any23.org/best/’%20+%20window.location);

Once on your browser, it returns a bunch of distilled RDF triples using the Any23 servlet if pressed on a Web page.

So, what’s next?

The Web of Data unit has just started. More things, from the next release of Sindice.com to other projects currently in inception, will see the light. I really hope to keep on contributing on the concrete consolidation of the Semantic Web, the Web of Data or Web3.0 or whatever we’d like to call it.

Posted in rdf, semantic web | Tagged semantic web, sindice, web of data | 3 Comments »

Cheap Linked Data identifiers

December 26, 2009

This is a (short) technical post.

Everyday, I face the problem of getting some Linked Data URIs that uniquely identify a “thing” starting from an ambiguous, poor and flat keyword or description. One of the first step dealing with the development of application that consumes Linked Data is to provide a mechanism that allows to link our own data sets to one (or more) LoD bubble. To gain a clear idea on why identifiers matters I suggest you to read this note from Dan Brickley: starting from some needs we encountered within the NoTube project he clearly underlined the importance of LoD identifiers. Even if the problem of uniquely identifying words and terms falls in the biggest category usually known as term disambiguation, I’d like to clarify here, that what I’m going to explain is a narrow restriction of the whole problem.

What I really need is a simple mechanism that allows me to convert one specific type of identifiers to a set of Linked Data URIs.

For example, I need something that given a book ISBN number it returns me a set of URIs that are referring to that book. Or, given the title of a movie I expect back some URIs (from DBpedia or LinkedMDB or whatever) identifying and describing it in a unique way.

Isn’t SPARQL enough for you to do that?

Yes, obviously the following SPARQL query may be sufficient:

but what I need is something quicker that I may invoke as an HTTP GET like:

http://localhost:8080/resolver?value=978-0-374-16527-7&category=isbn

returning back to me a simple JSON:

{ "mappings": [ "http://dbpedia.org/resource/Gomorrah_%28book%29"], "status": "ok" }

But the real issue here is the code overhead necessary if you want to add other kind of identifiers resolution. Let’s imagine, for instance, that I already implemented this kind of service and I want to add another resolution category. What I should do is to hard code another SPARQL query, modify the code allowing to invoke it as a service and redeploy it.

I’m sure we could do better.

If we give a closer look at the above SPARQL query, we easily figure out that the problem could be highly generalized. In fact, often resolving such kind of resolution means perform a SPARQL query asking for URIs that have a certain value for a certain property. As dbprop:isbn for the ISBN case.

And this is what I did the last two days: The NoTube Identity Resolver.

A simple Web service (described in the figure below) fully customizable by simply editing an XML configuration file.

NoTube Identity Resolver architecture

The resolvers.xml file allows you to provide a simple description of the resolution policy that will be accessible with a simple HTTP GET call.

Back to the ISBN example, the following piece of XML is enough to describe the resolver:

<resolver id=”2″ type=”normal”> <category>isbn</category> <endpoint>http://dbpedia.org/sparql</endpoint> <lookup>dbpedia-owl:isbn</lookup> <sameas>true</sameas> <matching>LITERAL</matching> </resolver>

Where:

category is the value that have to be passed as parameter in the HTTP GET call to invoke this resolver
endpoint is the address of a SPARQL Endpoint where make the resolution
lookup is the name of the property intended to be
type (optional) the rdf:type of the resources to be resolved
sameas boolean value enabling or not the calling of the SameAs.org service to gain equivalent URIs
matching (allowing only URI and LITERAL as value) this element describes the type of the value to be resolved.

Moreover, the NoTube Identity Resolver gives you also the possibility to specify more complex resolution policies through a SPARQL query as shown below:

<resolver id="3" type="custom"> <category>movie</category> <endpoint>http://dbpedia.org/sparql</endpoint> <sparql><![CDATA[SELECT DISTINCT ?subject WHERE { ?subject a <http://dbpedia.org/ontology/Film>. ?subject <http://dbpedia.org/property/title> ?title. FILTER (regex(?title, "#VALUE#")) }]]> </sparql> <sameas>true</sameas> </resolver>

In other words, every resolver described in the resolvers.xml file allows you to enable one kind of resolution mechanism without writing a line af Java code.

Do you want to try?

Just download the war package, get this resolvers.xml (or write your own), export the RESOLVERS_XML_LOCATION environment variable pointing to the folder where the resolvers.xml is located, deploy the war on your Apache Tomcat application server, start the application and try it out heading your browser to:

http://localhost:8080/notube-identity-resolver/resolver?value=978-0-374-16527-7&category=isbn

That’s all folks

Posted in rdf, semantic web | Tagged java, linked data, rdf, sparql | Leave a Comment »

RWW 2009 Top 10 Semantic Web products: one year later…

December 13, 2009

Just few days ago the popular ReadWriteWeb published a list of the 2009 Top Ten Semantic Web products as they did one year ago with the 2008 Top Ten.

This two milestones are a good opportunity to make something similar to a balance. Or just to do a quick overview on what’s changed in the “Web of Data”, only one year later.

The 2008 Top Ten foreseen the following applications, listed in the same ReadWriteWeb order and enriched with some personal opinions.

Yahoo Search Monkey

It’s great. Search Monkey represents the first kind of next-generation search engines due its capability to be fully customized by third party developers. Recently, a breaking news woke up the “sem webbers” of the whole planet: Yahoo started to show structured data exposed with RDFa in the search results page. That news bounced all over the Web and those interested in SEO started to appreciate Semantic Web technologies for their business. But, unfortunately, at the moment I’m writing, RDFa is not showed anymore on search results due to an layout update that broke this functionality. Even if there are rumors on a imminent fixing of this, the main problem is the robustness and the reliability of that kind of services: investors need to be properly guaranteed on the effectiveness of their investments.

Powerset

Probably, this neat application has became really popular when it has been acquired by Microsoft. It allows to make simple natural language queries like “film where Kevin Spacey acted” and, a first glance, the results seems really much better than other traditional search engines. Honestly I don’t really know what are the technologies they are using to do this magic. But, it would be nice to compare their results with an hypothetical service that translates such human text queries in a set of SPARQL queries over DBpedia. Anyone interested in do that? I’ll be more than happy to be engaged in a project like that.

Open Calais

With a large and massive branding operation these guys built the image of this service as it be the only one fitting everyone’s need when dealing with semantic enrichment of unstructured free-texts. Even this is partly true (why don’t mentioning the Apache UIMA Open Calais annotator?), there are a lot of other interesting services that are, for certain aspects, more intriguing than the Reuters one. Don’t believe me? Let’s give a try to AlchemyAPI.

Dapper

I have to admit my ignorance here. I never heard about it, but it looks very very interesting. Certainly this service that offers, mainly, some sort of semantic advertisement is more than promising. I’ll keep an eye on it.

Hakia

Down at the moment I’m writing. 😦

Tripit

Many friends of mine are using it and this could be enough to give it popularity. Again, I don’t know if they are using some of the W3C Semantic Web technologies to models their data. RDF or not, this is a neat example of semantic web application with a good potential: is this enough to you?

BooRah

Another case of personal ignorance. This magic is, mainly, a restaurant review site. BooRah uses semantic analysis and natural language processing to aggregate reviews from food blogs. Because of this, BooRah can recognize praise and criticism in these reviews and then rates restaurants accordingly to them. One criticism? The underlying data are perhaps not so much rich. Sounds impossible to me that searching for “Pizza in Italy” returns nothing.

Blue Organizer (or GetGlue?)

It’s not a secret that I consider Glue one of the most innovative and intriguing stuff on the Web. And when it appeared on the ReadWriteWeb 10 Top Semantic Web applications was far away from what is now. Just one year later, GetGlue (Blue Organizer seems to be the former name) appears as a growing and live community of people that realized how is important to wave the Web with the aim of a tool that act as a content cross-recommender. Moreover GetGlue provides a neat set of Web APIs that I’m widely using within the NoTube project.

Zemanta

A clear idea, a powerful branding and a well designed set of services accessible with Web APIs make Zemanta one of the most successful product on the stage. Do I have to say anything more? If you like Zemanta I suggest you to keep an eye also on Loomp, a nice stuff presented at the European Semantic Technology Conference 2009.

UpTake.com

Mainly, a semantic search engine over a huge database containing more than 400,000 hotels in the US. Where’s the semantic there? Uptake.com crawls and semantically extracts the information implicitly hidden in those records. A good example of how innovative technologies could be applied to well-know application domains as the hotels searching one.

On year later…

Indubitably, 2009 has been ruled by the Linked Data Initiative, as I love to call it. Officially Linked Data is about “using the Web to connect related data that wasn’t previously linked, or using the Web to lower the barriers to linking data currently linked using other methods” and, if we look to its growing rate, could be simple to bet on it success.

Here is the the 2009 top-ten where I omitted GetGlue, Zemanta and OpenCalais since they already appeared also in the 2008 edition:

Google Search Options and Rich Snippets

When this new feature of Google has been announced the whole Semantic Web community realized that something very powerful started to move along. Google Rich Snippet makes use of the RDFa contained in the HTML Web pages to power rich snippets feature.

Feedly

It’s a very very nice feeds aggregator built upon Google Reader, Twitter and FriendFeed. It’s easy to use, nice and really useful (well, at least it seems so to me) but, unfortunately, I cannot see where is the Semantic aspects here.

Apture

This JavaScript cool stuff allows publishers to add contextual information to links via pop-ups which display when users hover over or click on them. Watching HTML pages built with the aid of this tool, Apture closely remembers me the WordPress Snap-Shot plugin. But Apture seems richer than Snap-Shot since it allows the publishers to directly add links and other stuff they want to display when the pages are rendered.

BBC Semantic Music Project

Built upon Musicbrainz.org (one of the most representative Linked Data cloud) it’s a very remarkable initiative. Personally, I’m using it within the NoTube project to disambiguate Last.fm bands. Concretely, given a certain Last.fm band identifier, I make a query to the BBC /music that returns me a URI. With this URI I ask the sameas.org service to give me other URIs referring to the same band. In this way I can associate to every Last.fm bands a set of Linked Data URIs where obtain a full flavor of coherent data about them.

Freebase

It’s an open, semantically marked up shared database powered by Metaweb.com a great company based in San Francisco. Its popularity is growing fast, as ReadWriteWeb already noticed. Somehow similar to Wikipedia, Freebase provides all the mechanisms necessary to syndicate its data in a machine-readable form. Mainly, with RDF. Moreover, other Linked Data clouds started to add owl:sameAs links to Freebase: do I have to add something else?

Dbpedia

DBpedia is the nucleus of the Web of Data. The only thing I’d like to add is: it deserves to be on the ReadWriteWeb 2009 top-ten more than the others.

Data.gov

It’s a remarkable US government initiative to “increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.”. It’s a start and I dream to see something like this even here in Italy.

So what’s up in the end?

It’s my opinion that the 2009 has been the year of Linked Data. New clouds born every month, new links between the already existent ones are established and a new breed of developers are being aware of the potential and the threats of Linked Data consuming applications. It seems that the Web of Data is finally taking shape even if something strange is still in the air. First of all, if we give a closer look to the ReadWriteWeb 2009 Top Ten I have to underline that 3 products on 10 already were also in the 2008 chart. Maybe the popular blog liked to stress on the progresses that these products made but it sound a bit strange to me that they forgot nice products such as the FreeMix, Alchemy API, Sindice, OpenLink Virtuoso and the BestBuy.com usage of GoodRelations ontology. Secondly, 3 products listed in the 2009 chart are public-funded initiatives that, even if is reasonable due to the nature of the products, it leave me with the impression that private investors are not in the loop yet.

What I expect from the 2010, then?

A large and massive rush to using RDFa for SEO porpoises, a sustained grow of Linked Data clouds and, I really hope, the rise of a new application paradigm grounded to the consumption of such interlinked data.

Posted in rdf, semantic web | Tagged linked data, readwriteweb, semantic web, web of data | 3 Comments »

the italian political activism and the semantic web

September 20, 2009

Beppe Grillo

A couple of years ago, during his live show, the popular italian blogger and activist Beppe Grillo provided a quick demonstration about how the Web concretely realizes the “six degrees of separation”. The italian blogger, today a Web enthusiast, shown that it was possible to him to get in contact with someone very famous using a couple of different websites: imdb, Wikipedia and few others. Starting from a movie where he acted, he could reach the movie producer and the producer could be in contact with another actor due to previous work with this latter and so on.

This demonstration consisted in a series of links that were opened leading to some Web pages containing information where extract the relationships that the showman wants to achieve.

This gig came back to my mind while I was thinking on how, what I call the “Linked Data Philosophy”, is impacting on the traditional Web and I imagined what Beppe Grillo could show nowadays.

Just the following, simple, trivial and short SPARQL query:

Although Beppe is a great comedian it may be hard also for him making people laugh with this. But, the point here is not about laughs but about data: in this sense, the Web of Data is providing an outstanding and an extremely powerful way to access to incredible twine of machine readable interlinked data.

Recently, another nice and remarkable italian initiative appeared on the Web: OpenParlamento.it. It’s, mainly, a service where the Italian congressmen are displayed and they are positioned on a chart basing on the similarity of their votes on law proposals.

Ok. Cool. But how the Semantic Web could improve this stuff?

First of all, it would be very straightforward to provide a SPARQL endpoint providing some good RDF for this data. Like the following example:

<rdf:RDF>

<rdf:Description rdf:about:”http://openparlamento.it/senate/Mario_Rossi”>

<rdf:type rdf:resource=”http://openparlamento.it/ontology/Congressman”/>

<foaf:name>Mario Rossi</foaf:name>

<foaf:gender>male</foaf:gender>

<openp:politicalGroup rdf:resource=”http://openparlamento.it/groups/Democratic_Party”/>

<owl:sameas rdf:resource=”http://dbpedia.org/resource/Mario_Rossi”/>

</rdf:Description>

</rdf:RDF>

where names, descriptions, political belonging and more are provided. Moreover a property called openp:similarity could be used to map closer congressmen, using the same information of the already cited chart.

Secondly, all the information about congressmen are published on the official Italian chambers web site. Wrapping this data, OpenParlamento.it could provide an extremely exhaustive set of official information and, more important, links to DBpedia will be the key to get a full set of machine processable data also from other Linked Data clouds.

How to benefits from all of this? Apart the fact of employing a cutting-edge technology to syndicate data, everyone who wants link the data provided by OpenParlamento.it on his web pages can easily do it using RDFa. Like the follow example, where a fragment of an HTML page representing a news on the above congressman:

<div>

…

</div>

contains some RDFa linking that page to the OpenParlamento.it cloud.

With these technologies as a basis, a new breed of applications (like web crawlers, for those interested in SEO) will access and process these data in a new, fashionable and extremely powerful way.

This demonstration consisted in a series of links that were opened leading to some Web pages containing information where extract the relationships that the showman wants to achieve.

This gig came back to my mind while I was thinking on how, what I call the “Linked Data Philosophy”, is impacting on the traditional Web and I imagined what Beppe Grillo could show nowadays.

Just the following, simple, trivial and short SPARQL query:

construct {
   ?actor1 foaf:knows ?actor2
}
   where {
   ?movie dbpprop:starring ?actor1.
   ?movie dbpprop:starring ?actor2.
   ?movie a dbpedia-owl:Film.
   FILTER(?actor1 = <http://dbpedia.org/resource/Beppe_Grillo>)
}

Although Beppe is a great comedian it may be hard also for him making people laugh with this. But, the point here is not about laughs but about data: in this sense, the Web of Data is providing an outstanding and an extremely powerful way to access to incredible twine of machine readable interlinked data.

Ok. Cool. But how the Semantic Web could improve this stuff?

First of all, it would be very straightforward to provide a SPARQL endpoint providing some good RDF for this data. Like the following example:

<rdf:RDF>
   <rdf:Description rdf:about=”http://openparlamento.it/senate/Mario_Rossi”>
   <rdf:type rdf:resource=”http://openparlamento.it/ontology/Congressman”/>
   <foaf:name>Mario Rossi</foaf:name>
   <foaf:gender>male</foaf:gender>
   <openp:politicalGroup
   rdf:resource=”http://openparlamento.it/groups/Democratic_Party”/>
   <owl:sameas rdf:resource=”http://dbpedia.org/resource/Mario_Rossi”/>
   </rdf:Description>
</rdf:RDF>

where names, descriptions, political belonging and more are provided. Moreover a property called openp:similarity could be used to map closer congressmen, using the same information of the already cited chart.

How to benefits from all of this? Apart the fact of employing a cutting-edge technology to syndicate data, everyone who wants link the data provided by OpenParlamento.it on his web pages can easily do it using RDFa.

With these technologies as a basis, a new breed of applications (like web crawlers, for those interested in SEO) will access and process these data in a new, fashionable and extremely powerful way.

Is the time for those guys to embrace the Semantic Web , isn’t it?

Posted in rdf, semantic web | Tagged beppe grillo, dbpedia, linked data, political activism, web of data | Leave a Comment »