Monday, January 05, 2015

Find Orphan pages on website

You have a website and want to be sure that there are no Orphan pages? Not sure how to develop it? If so - you found a right place to go :), let's talk about most important steps.
I assume you have a list of all pages you want to verify (otherwise - it will be your first task)

Here is a logic/snippets
  1. We need functionality that can extract HTML from a web page. Later we will scan it and get internal links.
    private String getPageContent(String pageurl) throws Exception {
       StringBuffer buf = new StringBuffer();
       URL url = new URL(pageurl);
       InputStream is = url.openConnection().getInputStream();
       BufferedReader reader = new BufferedReader( new InputStreamReader( is )  );
       String line = null;
       while( ( line = reader.readLine() ) != null )  {
          buf.append(line);
       }
       reader.close();  
       return buf.toString();
    }
  2. Make a logic that can deal with DOM. I use jsoup to manipulate with HTML and I really recommend it (easy and fast). The method below select all links that begin with baseurl (it's domain of your website), in that way we can cut all external links and get only internal links.
    private List<string> getAllInernalLinks(String html, String baseurl) throws Exception {
       List<string> res = new ArrayList<string>();
       String select = "a[href^="+baseurl+"]";
       org.jsoup.nodes.Document dom = Jsoup.parse(html);
       Elements links = dom.select(select);
    
       for (Element link : links) {
          String src = link.attr("href");
          res.add(src);
       }
       return res;
    }
  3. Now we must build a List with all internal links from all pages on your website.
    String List<string> alllinks = getAllInernalLinks(html_from_all_pages, baseurl);
  4. We need to make sure that pageurl can be found in alllinks more then in pagelinks (to avoid case when page has link to itself).
    private boolean isOrphan(List<string> pagelinks, List<string> alllinks, String url) throws Exception {
       if (Collections.frequency(alllinks, url) > Collections.frequency(pagelinks, url)) {
          return false;
       }
       return true;
    }

No comments :