Tag Archives: wiki

Web 2.0 and Databases, can the two worlds meet?

(cross-post Experiment, Three)

A few weeks ago, I had an interesting conversation with Paolo on why web 2.0 tools are still struggling to find their way in the academic world. Back in September last year I attended the panel What Web 2.0 Has To Do With Databases?, which investigated the reasons why the database community has left behind in the research in the field of web 2.0.

Following Paolo’ suggestion, I post the notes I took at the time. Having clear in mind that the two topics are different, I think they are somehow correlated, because those people that consider blogs, wiki, etc., a “waste of time” are also the ones that are missing the opportunity in doing research in such an interesting field.

  • Sihem Amer-Yahia (Yahoo!)
  • Alon Halevy (Google)
  • AnHai Doan (University of Wisconsin)
  • Gerhard Weikum (Max-Planck Institute for Informatics, Germany)
  • Gustavo Alonso (ETH, Zurich)

Abstract can be found here.
Here is Alon Halevy’s post on the panel: read, in particular these two comments (1, 2) which, in my opinion, summarise quite well the situation.
Is the database community ready to accept the new challenges that are coming from the Web 2.0 world? The risk of “missing the train” is very high, considering that the commercial interest on these technologies is leaving academic research behind.


  • Web 2.0 is about people, unstructured data, imprecise queries, information retrieval.
  • Web 2.0 is not about structure and quality.

Unstructured data and applications are pervasive, they are everywhere and companies greatly exploit them, but:

  • A “holistic approach” is lacking (all current solutions are ad-hoc solutions)
  • The “structured methodology”, typical of the database community, should be brought into the Web 2.0.

Database people were not fully convinced by Web 2.0 and the two worlds seemed quite distant. In general, they do not believe that databases as we know them (their structure, methodologies, best practices, etc.) will ever lose their cenrtrality in any information management application. Even web 2.0 is only a “cool application” that will eventually be substituted by something else, whereas databases will still be in place.

This is quite a conservative point of view and even those who say that “traditional DBMS’ are dead” (Michel Stonebraker among others, but he’s not the only one) seem, in practice, to be a bit sceptical about the loss of centrality of the databases.

Everybody seemed to agree that tight schema integration is a buzz word that does not work in the real world, and this despite the fact that it has been studied for several years both in the industry and in the academia.

Web 2.0 seems the good compromise to have “real” integration, though this happens at the data level (and should probably be called “data reconciliation” instead). From the schema point of view, someone argued a real integration is not possible because there are no strong stakeholders demanding for it (these will not be neither the people on the street nor Google or Yahoo).

Google pushes forward the concept of a dataspace (btw, Halevy’s dataspace) that includes all users’ data. The physical system is left in the background, almost a legacy from the past: data matters, databases are needed for storage, reliability, etc. (are we talking about cloud computing?).

Someone’s comment: companies are keen of groups that do research on Web 2.0 and even encourage them to do it. However, Web 2.0 is about people and data: if the big companies do not release the data they have, how can the DB community research on it (and what should they analyse?)?

The two worlds seemed very distant and the main reason probably relies in the different backgrounds: database are structure, metodology and algorithms. Web 2.0 is based on randomness (well, some form of), no predefined schema and, among all, unpredictable social interactions that are kept away from databases. It is no surprise that the communication between the two is particularly difficult.

Web 2.0 in azienda

Ho appena letto (via Paolo) i risultati di una tesi di laurea sulla penetrazione di Wiki e altri strumenti “Web 2.0” nelle aziende di alcuni paesi europei e non.

L’argomento mi incuriosisce perche’ l’azienda per la quale lavoro sta sostituendo Lotus Notes con Confluence per la gestione della knowledge base interna. La fase di transizione dal “vecchio” al “nuovo” si e’ appena conclusa e non mi sono ancora formato un giudizio sull’utilita’ di uno strumento di tipo Wiki in ambito aziendale. Solo due commenti, uno positivo, il secondo un po’ meno.

Il motore di ricerca interno sembra soddisfare le aspettative. Non sto parlando di efficienza, quella si vedra’ se e quando la vecchia KB sara’ inserita nel Wiki, ma di flessibilita’ e della possibilita’ di effettuare full text search fra le sue pagine, inclusi i documenti. Questa funzionalita’ e’ fondamentale se si vuole che una base di dati diventi, nel tempo, una sorgente di informazioni.

Il sistema di tag. In teoria, il tagging e’ una delle funzionalita’ piu’ sbandierate e importanti dei Wiki, quella che permette di creare link “semantici”, o di “interesse”, fra utenti, documenti, informazioni e quant’altro. In pratica puo’ diventare un arma a doppio taglio e, se usata in maniera non corretta puo’ creare diversi problemi:

  • Eccessiva frammentazione (troppi tag rispetto al numero di documenti/pagine)
  • Overloading dei tag piu’ comuni (pochi tag utilizzati dalla maggior parte delle pagine)
  • Emergere di “informazioni rare” perche’ taggate male (alcune ottime pagine non vengono mai trovate).

Insomma, il mio primo giudizio, quello “a pelle”, non e’ negativo, ma nemmeno completamente positivo. Forse i miei dubbi sono piu’ legati alla non conoscenza della tecnologia che a una qualche pecca del prodotto, ma questo lo capiro’ solo fra qualche tempo.