10月12日突然对Ruby产生兴趣了,于是就找了本书《Programming Ruby 1.9》来看,结果被它迷上了。长期以来我一直认为我知道的各种编程语言都不够好,一直想自己设计一门语言。看了Ruby之后,发现的它的语法和语义正是我想要的,尤其是它的Mix in机制,是我寻找了好久的一个功能,Ruby有了,而且实现的很好,这一点很令人兴奋。唯一不同的是我想设计的是静态类型的语言,而Ruby是动态类型的,一旦引入类型的声明,语法就会很复杂,比如Scala,各有优劣,Scala也是个不错的语言。
上次学Lua的时候,两天学会,然后两天做了一个练习《绘制任意二元不等式的图像表示》。Ruyb比Lua复杂很多,要想学会估计至少需要两个星期。连着看了四五天书以后,每天一段时间看书,一段时间做练习。这次做什么练习呢?离线浏览代理服务器。
老早以前使用离线浏览的目的有两个:一是网速很慢,提前把网页下载好后,就可以飞速浏览;二是做电子书,把把网页下载到文件里,适当地做一些编辑,然后编译成CHM格式的电子书。现在做离线浏览则有两个新的目的:一是保存有时效的内容,比如cnBeta网站的新闻评论在文章发布24小时之后就不再显示了,保存到本地之后则可以不受此限制;二是可以内建一个搜索引擎,检索自己需要的内容。根据这两个目的,确定的方案是把网页下载保存到数据库里,然后做一个代理服务器,当请求的网址在数据库中存在时从数据库返回内容,不存在时再从外网获取内容。以下是我的实现:
- 1require 'net/http'
- 2
- 3module Proxies
- 4 Fiddler = Net::HTTP::Proxy('localhost', )
- 5end
- 6
- 7class WebClient
- 8 def initialize(proxy = nil, user_agent = 'WebSpider')
- 9 @http = if proxy then proxy else Net::HTTP end
- 10 @user_agent = user_agent
- 11 end
- 12
- 13 def visit(url, params = nil, rest = nil)
- 14 uri = URI.parse(URI.encode(url))
- 15 uri.query = URI.encode_www_form(params) if params
- 16 puts "Get #{uri}"
- 17 @http.start(uri.host, uri.port) do |http|
- 18 request = Net::HTTP::Get.new uri.request_uri
- 19 request["User-Agent"] = @user_agent
- 20 response = http.request request
- 21
- 22 puts "#{response.code}#{response.message}"
- 23 puts "wait #{sleep rest}" if rest
- 24 response
- 25 end
- 26 end
- 27end
- 1require_relative 'mongodoc'
- 2
- 3class Website
- 4 attr_accessor :domain, :encoding
- 5 def initialize(domain, encoding)
- 6 @domain = domain
- 7 @encoding = encoding
- 8 end
- 9end
- 10
- 11class Webpage
- 12 include MongoDoc::Document
- 13 attr_accessor :url, :data
- 14 def initialize(url, data)
- 15 @url = url
- 16 @data = data
- 17 end
- 18end
- 1require 'nokogiri'
- 2require 'date'
- 3require_relative 'webclient'
- 4require_relative 'website'
- 5require_relative 'database'
- 6
- 7module Websites
- 8 Cnbeta = Website.new('cnbeta.com', Encoding::GBK)
- 9 def Cnbeta.page_list(client)
- 10 latest = get_latest(client).to_i
- 11 recent = get_recent().to_i
- 12 if recent == then recent = latest - end
- 13 for id in recent.upto(latest)
- 14 yield "http://www.cnbeta.com/articles/#{id}.htm"
- 15 end
- 16 end
- 17
- 18 def Cnbeta.get_latest(client)
- 19 response = client.visit("http://www.cnbeta.com/")
- 20 if response.is_a? Net::HTTPOK
- 21 text = response.body.force_encoding(encoding)
- 22 doc = Nokogiri::HTML(text)
- 23 url = doc.css("div.newslist>dl>dt>a").first["href"]
- 24 id = url.match(/\/(?<id>\d+)\.htm/)["id"]
- 25 end
- 26 end
- 27
- 28 def Cnbeta.get_recent()
- 29 website = WebsitesDb.collection(domain)
- 30 if webpage = website.find(pubDate:{"$gt" => DateTime.now.prev_day.to_time}).sort(pubDate:"asc").next
- 31 id = webpage["url"].match(/\/(?<id>\d+)\.htm/)["id"]
- 32 end
- 33 end
- 34
- 35 def Cnbeta.process_page(client, webpage)
- 36 puts "process #{webpage["url"]}"
- 37 id = webpage["url"].match(/\/(?<id>\d+)\.htm/)["id"]
- 38 text = webpage["data"].to_s.force_encoding(encoding)
- 39
- 40 begin
- 41 if text =~ /^<meta http-equiv="refresh"/
- 42 return nil
- 43 end
- 44 rescue ArgumentError => error
- 45 puts error.message
- 46 end
- 47
- 48 doc = Nokogiri::HTML(text)
- 49 title = doc.css("#news_title").inner_html.encode("utf-8")
- 50 time = doc.css("#news_author > span").inner_html.encode("utf-8").match(/\u53D1\u5E03\u4E8E (?<time>.*)\|/)["time"]
- 51
- 52 time = DateTime.parse(time + " +0800")
- 53 if time < DateTime.now.prev_day
- 54 puts "skip"
- 55 return nil
- 56 end
- 57
- 58 if g_content = doc.css("#g_content").first
- 59 comments = client.visit("http://www.cnbeta.com/comment/g_content/#{id}.html", nil, 0.2).body.force_encoding("utf-8")
- 60 g_content.inner_html = comments.encode(encoding)
- 61 end
- 62
- 63 if normal = doc.css("#normal").first
- 64 comments = client.visit("http://www.cnbeta.com/comment/normal/#{id}.html", nil, 0.2).body.force_encoding("utf-8")
- 65 normal.inner_html = comments.encode(encoding)
- 66 end
- 67
- 68 webpage["title"] = title
- 69 webpage["pubDate"] = time.to_time
- 70 if g_content and normal
- 71 webpage["data"] = BSON::Binary.new(doc.to_html)
- 72 else
- 73 puts "Comment on #{id} skipped"
- 74 end
- 75 puts "done"
- 76 return webpage
- 77 end
- 78end
- 1require_relative 'websites'
- 2
- 3class WebSpider
- 4 def initialize(proxy = nil)
- 5 @client = WebClient.new proxy
- 6 end
- 7
- 8 def collect(site)
- 9 website = WebsitesDb.collection(site.domain)
- 10 site.page_list(@client) do |pageUrl|
- 11 if webpage = website.find_one(url:pageUrl)
- 12 if processed = site.process_page(@client, webpage)
- 13 website.save(processed)
- 14 end
- 15 elsif webpage = collect_page(pageUrl)
- 16 webpage.data = BSON::Binary.new(webpage.data)
- 17 webpage = webpage.to_hash
- 18 if processed = site.process_page(@client, webpage)
- 19 website.insert(processed)
- 20 else
- 21 website.insert(webpage)
- 22 end
- 23 end
- 24 end
- 25 end
- 26
- 27 def collect_page(pageUrl)
- 28 response = @client.visit(pageUrl, nil, 0.5)
- 29 case response
- 30 when Net::HTTPOK
- 31 webpage = Webpage.new pageUrl, response.body
- 32 else
- 33 return
- 34 end
- 35 end
- 36end
- 37
- 38# spider = WebSpider.new Proxies::Fiddler
- 39spider = WebSpider.new
- 40spider.collect Websites::Cnbeta
- 1# encoding: utf-8
- 2require "webrick"
- 3require "webrick/httpproxy"
- 4require_relative 'database'
- 5
- 6class OfflineProxyServer < WEBrick::HTTPProxyServer
- 7 def do_GET(req, res)
- 8 uri = req.request_uri.to_s
- 9 WebsitesDb.collections.each do |site|
- 10 if page = site.find_one(url:uri)
- 11 @logger.info("Found #{uri}")
- 12 res['connection'] = "close"
- 13 res.status =
- 14 res.body = page["data"].to_s
- 15 return
- 16 end
- 17 end
- 18 super
- 19 end
- 20end
- 21
- 22def search_db(keyword)
- 23 keyword.force_encoding("utf-8")
- 24 site = WebsitesDb.collection("cnbeta.com")
- 25 site.find(title:/#{keyword}/).sort(pubDate:"desc").limit(100).to_a
- 26end
- 27
- 28search = WEBrick::HTTPServlet::ProcHandler.new ->(req, resp) do
- 29 keyword = req.query["keyword"].to_s
- 30 result = if keyword.empty? then [] else search_db(keyword) end
- 31 resp['Content-Type'] = "text/html"
- 32 resp.body = %{
- 33 <html>
- 34 <head><meta charset="utf-8"><title>离线搜索</title><style>form{margin-left:30px}li{margin:10px}</style></head>
- 35 <body>
- 36 <form>
- 37 <em>cnBeta</em>
- 38 <input type="text" name="keyword" value="#{keyword}"><input type="submit" value="搜索">
- 39 </form>
- 40 <ul>
- 41#{
- 42 result.map{|page|"<li><a href=#{page['url']} target=_blank>#{page['title']}</a> #{page['pubDate']}</li>"}.join
- 43 }
- 44 </ul>
- 45 </body></html>
- 46}
- 47end
- 48
- 49server = OfflineProxyServer.new(Port: )
- 50server.mount("/search", search)
- 51Signal.trap(:INT){ server.shutdown }
- 52server.start
先运行webspider.rb,之后启动proxyserver.rb,在浏览器中把代理服务器设置成localhost:9999,就可以离线浏览已经在数据库保存过的网站。打开http://localhost:9999/search,就可以看到离线搜索页面。 来张图,看一下离线搜索的效果
如果你也想运行这个程序,需要以下步骤:
C语言关注的是底层功能,Ruby关注的是高层功能,程序员应该把Ruby当作日常语言。通过这个练习,大家也可以看到Ruby写的代码简短易读,功能强大。整个程序全部代码加起来只有231行,很有说服力。当然了,这只是一个初始的版本,肯定有不足之处,请高手指正。在这个基础还可以增加新的功能,比如过滤广告,全文检索等等,发挥你的想象力!
以上就是用Ruby写的离线浏览代理服务器,只有两百多行代码的详细内容,更多关于用Ruby写的离线浏览代理服务器,只有两百多行代码的资料请关注九品源码其它相关文章!