Information Extraction from Blogs

Marie-Francine Moens
Handbook of Research on Web Log AnalysisThis chapter introduces information extraction from blog texts. It argues that the classical techniques for information extraction that are commonly used for mining well-formed texts lose some of their validity in the context of blogs. This finding is demonstrated by considering each step in the information extraction process and by illustrating this problem in different applications. In order to tackle the problem of mining content from blogs, algorithms are developed that combine different sources of evidence in the most flexible way. The chapter concludes with ideas for future research.
Handbook of Research on Web Log AnalysisThis chapter is organized as follows. We continue with some background (next section) on information extraction in general and information extraction from blogs in particular. We outline the history of information extraction. In a subsequent section we consider the different steps in an information extraction task and focus on particular issues when dealing with blog data. We discuss tokenization and lexical analysis, natural language processing and finally information extraction. In the latter part of the chapter we go deeper into a few specific applications: topic and thread detection, opinion mining, and argumentation detection. Wherever possible, we illustrate our findings with our own research experiences. We conclude with a number of prospects for further research.
