How to collect feed data from sites and blogs

Almost all sites or blogs can be read online or via their RSS feeds. An RSS feed is a simple XML document that contains information about the blog and all the entries. This is a valuable information as it can be elaborated in different ways for example for text analysis. This guide is about how to harvest feed data from economist but it can be applied to any site that has a RSS feed section. Lets go through the steps


Go to www.economist.com/rss and chose a topic. Once that you click one of the topics an xml page is shown. Take the url of that page and copy it into a text file. Lets call it urls.txt . Choose another topic and copy paste the url again in the urls.txt file. At the end you will have two or more urls in your urls text file.


-->

Now you have to prepare the java client that will connect with each of the urls you saved in your text file and process the data of the feed entries for example title, date and contents and write it in a csv file. Parsing the XML of a feed is not a very fast and easy task done by hand. Thats why we will use a java library that can do this task for us. It is called feed4j and can be found here http://www.sauronsoftware.it/projects/feed4j/download.php

Download and open the zip file. Basically the files you need for your project are: feed4j.jar, dom4j.jar, xml-apis.jar, xercesImp.jar and nekohtml.jar. Create a java project in Eclipse and call it TestFeedParser. Before you start coding for the parser add to the build path of your java project the jar files we identified previously. Lets go on with builting the java client

Create a java class. Lets call it TestFP. For builting the client we need to go through these steps. 1- read the urls we have in the txt file and use each url to get the xml page that will be parsed each time. 2- parse this data properly and writte it in a csv file.

So lets make a main method first and and create a BufferedReader for reading the urls and a PrintWriter for writting the feed data. just like in the following code snippet

        public class TestFP {

 public static void main(String[] args) throws Exception {

		BufferedReader br = new BufferedReader(new FileReader(new File("data/economist.txt")));
		int k = 0;
		String line;

		String titolo = null;
		String descrizione = null;

		FileWriter fileWritter = new FileWriter(new File("feedatafromeconomist.txt"));
		PrintWriter out = new PrintWriter(fileWritter);

Then lets iterate the urls txt file one url at a time (while loop) and parse the response with feed4j. Basically each feed Item has a title, description of content and comments. There are many other items but for this guide we consider these three. The bellow does just that

 
  while((line = br.readLine())!= null){
        	  
        	  URL url = new URL(line); //  returns a URL object that the FeedParser.parse() will parse
        	  try{
        		  Feed feed = FeedParser.parse(url);

        		  System.out.println("** HEADER **");
        		  FeedHeader header = feed.getHeader();
        		  String blog = header.getTitle();
        		  System.out.println("titolo blog "+blog);
        		  out.println("");
        		  out.println(""+blog+"");

        		  System.out.println("** ITEMS **");
        		  int items = feed.getItemCount();
        		  for (int i = 0; i < items; i++) {   // iterates on the feed items
        			  FeedItem item = feed.getItem(i);
        			  System.out.println("Title: " + item.getTitle());

        			  titolo =  item.getTitle();
        			  
        			  descrizione = item.getDescriptionAsText();

        			  out.println(""+titolo+"");
        			  out.println(""+descrizione+"");

        			  out.println(""+item.getComments()+"");
        			  k++;
        		  }
        	  }catch (Exception e){
        		  if(e.equals("it.sauronsoftware.feed4j.FeedXMLParseException")){
        			  continue;
        		  }
        	  }
        	  out.println("");
          }
          br.close(); // close the read stream
          out.close(); // close the write stream
          } // close main method
		} // close class

3-Run and check the results : simply run the java code in Eclipse. At the end the code will create a file called feedatafromeconomist.txt contiaining the RSS feed of the articles. You can modify the code to print other fields of the Feed object.

If you missed something in puting the pieces together then you can download the working project from Github or import it in eclipse.

In the analyzing and visualizing sections you find interesting IDEAS on how to analyse this data and visualize it to get insights and usefull information.

If you enjoyed this article guide please share

Tweet

or follow ..

Follow @datamap2t