Python爬蟲的數(shù)據(jù)入庫操作(python爬蟲導入數(shù)據(jù)庫)

Python爬蟲是一種自動化技術，主要用于從互聯(lián)網(wǎng)上抓取數(shù)據(jù)。在進行Python爬蟲時，我們通常需要將爬取到的數(shù)據(jù)存入數(shù)據(jù)庫中，以便進一步處理和分析。數(shù)據(jù)入庫是Python爬蟲的最后一步，也是最為關鍵的一步。本文將詳細介紹，包括數(shù)據(jù)清洗、數(shù)據(jù)存儲和數(shù)據(jù)讀取等。

一、Python爬蟲數(shù)據(jù)清洗

在進行Python爬蟲時，我們通常會面臨以下問題：

1. 爬取到的數(shù)據(jù)格式不規(guī)范，需要進行清洗。

2. 爬取的數(shù)據(jù)量太大，需要篩選出有價值的數(shù)據(jù)。

3. 爬取到的數(shù)據(jù)中包含大量垃圾信息，需要進行過濾。

針對這些問題，我們需要進行數(shù)據(jù)清洗。數(shù)據(jù)清洗的主要任務是將爬蟲爬取的原始數(shù)據(jù)進行初步的處理和整理，使其能夠被后續(xù)的數(shù)據(jù)存儲程序正確地處理。具體包括以下幾個方面：

1. 字符串處理

在進行數(shù)據(jù)清洗時，我們通常需要對爬取到的字符串進行處理。比如，我們需要去除字符串中的空格和換行符、將中文轉(zhuǎn)換為Unicode編碼等。

2. 數(shù)據(jù)類型轉(zhuǎn)換

在爬取到的數(shù)據(jù)存儲到數(shù)據(jù)庫中之前，我們需要將其轉(zhuǎn)換為相應的數(shù)據(jù)類型。比如，將字符串轉(zhuǎn)換為數(shù)字、日期等。

3. 數(shù)據(jù)篩選

爬蟲數(shù)據(jù)中通常包含豐富的信息，但不是所有信息都是有價值的。因此，我們需要對爬取到的數(shù)據(jù)進行篩選，只選擇敲質(zhì)量較高的數(shù)據(jù)。

4. 垃圾信息過濾

爬蟲爬取的數(shù)據(jù)中往往包含大量垃圾信息，比如廣告信息、網(wǎng)絡用語等。這些信息對數(shù)據(jù)分析和處理都沒有任何幫助，因此我們需要將其過濾掉。

二、Python爬蟲數(shù)據(jù)存儲

在將Python爬取的數(shù)據(jù)存儲到數(shù)據(jù)庫中時，我們需要確定數(shù)據(jù)庫類型、建立數(shù)據(jù)庫表結(jié)構(gòu)、創(chuàng)建操作數(shù)據(jù)庫的程序等。數(shù)據(jù)存儲的過程包括以下幾個步驟：

1. 確定數(shù)據(jù)庫類型

在選擇數(shù)據(jù)庫時，應該根據(jù)具體的應用場景選擇適當?shù)臄?shù)據(jù)庫類型。常見的數(shù)據(jù)庫類型包括MySQL、Oracle、SQL Server、MongoDB等。

2. 建立數(shù)據(jù)庫表結(jié)構(gòu)

在將爬取到的數(shù)據(jù)存儲到數(shù)據(jù)庫中之前，我們需要先建立數(shù)據(jù)庫表結(jié)構(gòu)。數(shù)據(jù)庫表結(jié)構(gòu)的設計應該根據(jù)需要存儲的數(shù)據(jù)類型進行設計。

3. 創(chuàng)建操作數(shù)據(jù)庫的程序

在將數(shù)據(jù)存儲到數(shù)據(jù)庫中之前，我們需要先編寫程序，以便操作數(shù)據(jù)庫。該程序是將Python爬蟲爬取到的數(shù)據(jù)存儲到數(shù)據(jù)庫中的關鍵，需要保證程序的正確性和可靠性。

4. 數(shù)據(jù)存儲

在完成上述準備工作之后，我們就可以將爬取到的數(shù)據(jù)存儲到數(shù)據(jù)庫中了。存儲數(shù)據(jù)庫的方式包括以下幾種：

（1）使用SQL語句將數(shù)據(jù)寫入數(shù)據(jù)庫中。

（2）使用ORM框架將數(shù)據(jù)寫入數(shù)據(jù)庫中。

（3）使用NoSQL數(shù)據(jù)庫將數(shù)據(jù)寫入數(shù)據(jù)庫中。

三、Python爬蟲數(shù)據(jù)讀取

在將Python爬取的數(shù)據(jù)存儲到數(shù)據(jù)庫中之后，我們需要對這些數(shù)據(jù)進行讀取和處理。Python爬蟲數(shù)據(jù)讀取的方式包括以下幾種：

1. 使用SQL語句進行數(shù)據(jù)讀取，然后使用Python程序進行處理。

2. 使用ORM框架進行數(shù)據(jù)讀取和處理。

3. 直接使用NoSQL數(shù)據(jù)庫進行數(shù)據(jù)讀取和處理。

無論選擇哪種方式進行數(shù)據(jù)讀取，都需要保證讀取數(shù)據(jù)的正確性和可靠性，并能夠快速地讀取到有價值的數(shù)據(jù)。

本文介紹了，包括數(shù)據(jù)清洗、數(shù)據(jù)存儲和數(shù)據(jù)讀取等。數(shù)據(jù)入庫是Python爬蟲的最后一步，對數(shù)據(jù)分析和處理具有極為重要的作用。在進行Python爬蟲時，我們應該注重數(shù)據(jù)清洗和存儲的工作，并選擇適當?shù)臄?shù)據(jù)讀取方式進行數(shù)據(jù)處理。

成都網(wǎng)站建設公司-創(chuàng)新互聯(lián)為您提供網(wǎng)站建設、網(wǎng)站制作、網(wǎng)頁設計及定制高端網(wǎng)站建設服務！

如何用python爬取豆瓣讀書的數(shù)據(jù)

我們通過bs4解析我們需要的檔友字段，如：出版時間，作者/譯者，豆瓣評分，售價，評價人數(shù)等。

# 解析單個tag頁面下單頁的信息

def parse_tag_page(html):

try:

soup = BeautifulSoup(html,”lxml”)

tag_name = soup.select(‘title’).get_text().strip()

list_soup = soup.find(‘ul’, {‘class’: ‘subject-list’})

if list_soup == None:

print(‘獲取信息列表失敗’)

else:

for book_info in list_soup.findAll(‘div’, {‘class’: ‘info’}):

# 書名

title = book_info.find(‘a(chǎn)’).get(‘title’).strip()

# 評價人數(shù)

people_num = book_info.find(‘span’, {‘class’: ‘pl’}).get_text().strip()

# 出版信息,作者寬蠢含

pub = book_info.find(‘div’, {‘class’: ‘pub’}).get_text().strip()

pub_list = pub.split(‘/’)

try:

author_info = ‘作者/譯者： ‘ + ‘/’.join(pub_list)

except:

author_info = ‘作者/譯者：暫無’

try:

pub_info = ‘出版信息： ‘ + ‘/’慎笑.join(pub_list)

except:

pub_info = ‘出版信息：暫無’

try:

price_info = ‘價格： ‘ + ‘/’.join(pub_list)

except:

price_info = ‘價格：暫無’

try:

rating_num= book_info.find(‘span’, {‘class’: ‘rating_nums’}).get_text().strip()

except:

rating_num = ‘0.0’

book_data = {

‘title’: title,

‘people_num’: people_num,

‘a(chǎn)uthor_info’: author_info,

‘pub_info’: pub_info,

‘price_info’: price_info,

‘rating_num’: rating_num

}

# return book_data

if book_data:

save_to_mongo(book_data,tag_name)

except:

print(‘解析錯誤’)

return None

這兩天爬了豆瓣讀書的十萬條左右的書目信息，用時將近一天，現(xiàn)在趁著這個空閑把代碼總結(jié)一下，還是菜鳥，都是用的最簡單最笨的方法，還請路過的大神不吝賜教。

之一步，先看一下我們需要的庫：

import requests#用來請求網(wǎng)頁

from bs4 import BeautifulSoup#解析網(wǎng)頁

import time#設置延時時間，防止爬取過于頻繁被封IP號

import re#正則表達式庫

import pymysql#由于爬取的數(shù)據(jù)太多，我們要把他存入MySQL數(shù)據(jù)庫中，這個庫用于連接數(shù)據(jù)庫

import random#這個庫里用到了產(chǎn)生隨機數(shù)的randint函數(shù)，和上面的time搭配，使爬取間隔時間隨機

這個是豆瓣的網(wǎng)址：x-sorttags-all

我們要從這里獲取所有分類的標簽鏈接，進一步去爬取里面的信息，代碼先貼上來：

import requests

from bs4 import BeautifulSoup#導入庫

url=”httom/tag/?icn=index-nav”

wb_data=requests.get(url) #請求網(wǎng)址

soup=BeautifulSoup(wb_data.text,”lxml”) #解析網(wǎng)頁信息

tags=soup.select(“#content > div > div.article > div > div > table > tbody > tr > td > a”)

#根據(jù)CSS路徑查找標簽信息，CSS路徑獲取方法，右鍵-檢查-copy selector，tags返回的是一個列表

for tag in tags:

tag=tag.get_text() #將列表中的每一個標簽信息提取出來

helf=”hom/tag/”

#觀察一下豆瓣的網(wǎng)址，基本都是這部分加上標簽信息，所以我們要組裝網(wǎng)址慶罩，用于爬取標簽詳情頁

url=helf+str(tag)

print(url) #網(wǎng)址組裝完畢，輸出

以上我們便爬取了所有標簽下的網(wǎng)址，我們將這個文件命名為channel,并在channel中創(chuàng)建一個channel字符串，放上我們所有爬取的網(wǎng)址信息，等下爬取詳情頁的時候直接從這里提取鏈接就好了，如下：

channel=”’

tag/程序

”’

現(xiàn)在，我們開始第二個程序。

QQ圖片.png

標簽頁下每一個圖片的信息基本都是這樣的，我們可以直接從這里提取到標題，作者，出版社，出版時間，價格，評價人數(shù)，以及評分等信息迅拍（有些外國作品還會有譯者信息），提取方法與提取標簽類似，也是根據(jù)CSS路徑提取。

我們先用一個網(wǎng)址來實驗爬取：

url=”htt/tag/科技”

wb_data = requests.get(url)

soup = BeautifulSoup(wb_data.text.encode(“utf-8”), “l(fā)xml”)

tag=url.split(“?”).split(“/”) #從鏈接里面提取標簽信息，方便存儲

detils=soup.select(“#subject_list > ul > li > div.info > div.pub”) #抓取作者，出版社信息，稍后我們用spite()函數(shù)再將他們分離出來

scors=soup.select(“#subject_list > ul > li > div.info > div.star.clearfix > span.rating_nums”) #抓取評分信息

persons=soup.select(“#subject_list > ul > li > div.info > div.star.clearfix > span.pl”) #評價人數(shù)

titles=soup.select(“#subject_list > ul > li > div.info > h2 > a”) #書名

#以上抓取的都是我們需要的html語言標簽信息，我們還需要將他們一一分離出來

for detil,scor,person,title in zip(detils,scors,persons,titles):

#用一個zip()函數(shù)實現(xiàn)一畝差羨次遍歷

#因為一些標簽中有譯者信息，一些標簽中沒有，為避免錯誤，所以我們要用一個try來把他們分開執(zhí)行

try:

author=detil.get_text().split(“/”,4).split() #這是含有譯者信息的提取辦法，根據(jù)“/” 把標簽分為五部分，然后依次提取出來

yizhe= detil.get_text().split(“/”, 4)

publish=detil.get_text().split(“/”, 4)

time=detil.get_text().split(“/”, 4).split().split(“-“) #時間我們只提取了出版年份

price=ceshi_priceone(detil)#因為價格的單位不統(tǒng)一，我們用一個函數(shù)把他們換算為“元”

scoe=scor.get_text() if True else “” #有些書目是沒有評分的，為避免錯誤，我們把沒有評分的信息設置為空

person=ceshi_person(person) #有些書目的評價人數(shù)顯示少于十人，爬取過程中會出現(xiàn)錯誤，用一個函數(shù)來處理

title=title.get_text().split()

#當沒有譯者信息時，會顯示IndexError，我們分開處理

except IndexError:

try:

author=detil.get_text().split(“/”, 3).split()

yizhe=””#將detil信息劃分為4部分提取，譯者信息直接設置為空，其他與上面一樣

publish=detil.get_text().split(“/”, 3)

time=detil.get_text().split(“/”, 3).split().split(“-“)

price=ceshi_pricetwo(detil)

scoe=scor.get_text() if True else “”

person=ceshi_person(person)

title=title.get_text().split()

except (IndexError,TypeError):

continue

#出現(xiàn)其他錯誤信息，忽略，繼續(xù)執(zhí)行（有些書目信息下會沒有出版社或者出版年份，但是數(shù)量很少，不影響我們大規(guī)模爬取，所以直接忽略）

except TypeError:

continue

#提取評價人數(shù)的函數(shù)，如果評價人數(shù)少于十人，按十人處理

def ceshi_person(person):

try:

person = int(person.get_text().split()) – 4>)

except ValueError:

person = int(10)

return person

#分情況提取價格的函數(shù)，用正則表達式找到含有特殊字符的信息，并換算為“元”

def ceshi_priceone(price):

price = detil.get_text().split(“/”, 4).split()

if re.match(“USD”, price):

price = float(price) * 6

elif re.match(“CNY”, price):

price = price

elif re.match(“\A$”, price):

price = float(price) * 6

else:

price = price

return price

def ceshi_pricetwo(price):

price = detil.get_text().split(“/”, 3).split()

if re.match(“USD”, price):

price = float(price) * 6

elif re.match(“CNY”, price):

price = price

elif re.match(“\A$”, price):

price = float(price) * 6

else:

price = price

return price

實驗成功后，我們就可以爬取數(shù)據(jù)并導入到數(shù)據(jù)庫中了，以下為全部源碼，特殊情況會用注釋一一說明。

import requests

from bs4 import BeautifulSoup

import time

import re

import pymysql

from channel import channel #這是我們之一個程序爬取的鏈接信息

import random

def ceshi_person(person):

try:

person = int(person.get_text().split()) – 4>)

except ValueError:

person = int(10)

return person

def ceshi_priceone(price):

price = detil.get_text().split(“/”, 4).split()

if re.match(“USD”, price):

price = float(price) * 6

elif re.match(“CNY”, price):

price = price

elif re.match(“\A$”, price):

price = float(price) * 6

else:

price = price

return price

def ceshi_pricetwo(price):

price = detil.get_text().split(“/”, 3).split()

if re.match(“USD”, price):

price = float(price) * 6

elif re.match(“CNY”, price):

price = price

elif re.match(“\A$”, price):

price = float(price) * 6

else:

price = price

return price

#這是上面的那個測試函數(shù)，我們把它放在主函數(shù)中

def mains(url):

wb_data = requests.get(url)

soup = BeautifulSoup(wb_data.text.encode(“utf-8”), “l(fā)xml”)

tag=url.split(“?”).split(“/”)

detils=soup.select(“#subject_list > ul > li > div.info > div.pub”)

scors=soup.select(“#subject_list > ul > li > div.info > div.star.clearfix > span.rating_nums”)

persons=soup.select(“#subject_list > ul > li > div.info > div.star.clearfix > span.pl”)

titles=soup.select(“#subject_list > ul > li > div.info > h2 > a”)

for detil,scor,person,title in zip(detils,scors,persons,titles):

l = #建一個列表，用于存放數(shù)據(jù)

try:

author=detil.get_text().split(“/”,4).split()

yizhe= detil.get_text().split(“/”, 4)

publish=detil.get_text().split(“/”, 4)

time=detil.get_text().split(“/”, 4).split().split(“-“)

price=ceshi_priceone(detil)

scoe=scor.get_text() if True else “”

person=ceshi_person(person)

title=title.get_text().split()

except IndexError:

try:

author=detil.get_text().split(“/”, 3).split()

yizhe=””

publish=detil.get_text().split(“/”, 3)

time=detil.get_text().split(“/”, 3).split().split(“-“)

price=ceshi_pricetwo(detil)

scoe=scor.get_text() if True else “”

person=ceshi_person(person)

title=title.get_text().split()

except (IndexError,TypeError):

continue

except TypeError:

continue

l.append()

#將爬取的數(shù)據(jù)依次填入列表中

sql=”INSERT INTO allbooks values(%s,%s,%s,%s,%s,%s,%s,%s,%s)” #這是一條sql插入語句

cur.executemany(sql,l) #執(zhí)行sql語句，并用executemary()函數(shù)批量插入數(shù)據(jù)庫中

conn.commit()

#主函數(shù)到此結(jié)束

# 將Python連接到MySQL中的python數(shù)據(jù)庫中

conn = pymysql.connect( user=”root”,password=”123123″,database=”python”,charset=’utf8′)

cur = conn.cursor()

cur.execute(‘DROP TABLE IF EXISTS allbooks’) #如果數(shù)據(jù)庫中有allbooks的數(shù)據(jù)庫則刪除

sql = “””CREATE TABLE allbooks(

title CHAR(255) NOT NULL,

scor CHAR(255),

author CHAR(255),

price CHAR(255),

time CHAR(255),

publish CHAR(255),

person CHAR(255),

yizhe CHAR(255),

tag CHAR(255)

)”””

cur.execute(sql) #執(zhí)行sql語句，新建一個allbooks的數(shù)據(jù)庫

start = time.clock() #設置一個時鐘，這樣我們就能知道我們爬取了多長時間了

for urls in channel.split():

urlss= #從channel中提取url信息，并組裝成每一頁的鏈接

for url in urlss:

mains(url)#執(zhí)行主函數(shù)，開始爬取

print(url)#輸出要爬取的鏈接，這樣我們就能知道爬到哪了，發(fā)生錯誤也好處理

time.sleep(int(format(random.randint(0,9)))) #設置一個隨機數(shù)時間，每爬一個網(wǎng)頁可以隨機的停一段時間，防止IP被封

end = time.clock()

print(‘Time Usage:’, end – start) #爬取結(jié)束，輸出爬取時間

count = cur.execute(‘select * from allbooks’)

print(‘has %s record’ % count)#輸出爬取的總數(shù)目條數(shù)

# 釋放數(shù)據(jù)連接

if cur:

cur.close()

if conn:

conn.close()

這樣，一個程序就算完成了，豆瓣的書目信息就一條條地寫進了我們的數(shù)據(jù)庫中，當然，在爬取的過程中，也遇到了很多問題，比如標題返回的信息拆分后中會有空格，寫入數(shù)據(jù)庫中會出現(xiàn)錯誤，所以只截取了標題的之一部分，因而導致數(shù)據(jù)庫中的一些書名不完整，過往的大神如果有什么辦法，還請指教一二。

等待爬取的過程是漫長而又欣喜的，看著電腦上一條條信息被刷出來，成就感就不知不覺涌上心頭；然而如果你吃飯時它在爬，你上廁所時它在爬，你都已經(jīng)爬了個山回來了它還在爬時，便會有點崩潰了，擔心電腦隨時都會壞掉（還是窮學生換不起啊啊啊啊~）

如何通過python操作xampp里面的MySQL數(shù)據(jù)庫

您好，

import

MySQLdb

try:

conn=MySQLdb.connect(host=’localhost’,user=’root’,passwd=’root’,port=3306)

cur=conn.cursor()

conn.select_db(‘python’)

count=cur.execute(‘瞎脊select *

from test’磨仿滲)

‘there has %s rows record’ % count

result=cur.fetchone()

result

‘ID: %s info %s’ % result

results=cur.fetchmany(5)

for r in results:

‘==’*10

cur.scroll(0,mode=’absolute’)

results=cur.fetchall()

for r in results:

conn.commit()

cur.close()

conn.close()

except MySQLdb.Error,e:

“Mysql Error %d: %s”大握 % (e.args, e.args)python爬蟲導入數(shù)據(jù)庫的介紹就聊到這里吧，感謝你花時間閱讀本站內(nèi)容，更多關于python爬蟲導入數(shù)據(jù)庫,Python爬蟲的數(shù)據(jù)入庫操作,如何用python爬取豆瓣讀書的數(shù)據(jù),如何通過python操作xampp里面的MySQL數(shù)據(jù)庫的信息別忘了在本站進行查找喔。

香港服務器選創(chuàng)新互聯(lián)，香港虛擬主機被稱為香港虛擬空間/香港網(wǎng)站空間，或者簡稱香港主機/香港空間。香港虛擬主機特點是免備案空間開通就用，創(chuàng)新互聯(lián)香港主機精選cn2+bgp線路訪問快、穩(wěn)定！

分享題目：Python爬蟲的數(shù)據(jù)入庫操作(python爬蟲導入數(shù)據(jù)庫)
分享地址：http://uogjgqi.cn/article/cocihpe.html

掃二維碼與項目經(jīng)理溝通

我們在微信上24小時期待你的聲音

解答本文疑問/技術咨詢/運營咨詢/技術建議/互聯(lián)網(wǎng)交流

av激情亚洲男人的天堂国语,日韩欧美精品一中文字幕,无码av一区二区三区无码,国产又色又爽又刺激的a片,国产又色又爽又刺激的a片

Python爬蟲的數(shù)據(jù)入庫操作(python爬蟲導入數(shù)據(jù)庫)

如何用python爬取豆瓣讀書的數(shù)據(jù)

如何通過python操作xampp里面的MySQL數(shù)據(jù)庫

掃二維碼與項目經(jīng)理溝通

其他資訊

行業(yè)動態(tài)

企業(yè)網(wǎng)站建設的重要性！

服務項目

網(wǎng)站建設

移動端/APP

微信/小程序

技術支持

其它服務

更多服務項目

聯(lián)系吧在百度地圖上找到我們

電話：13518219792

av激情亚洲男人的天堂国语,日韩欧美精品一中文字幕,无码av一区二区三区无码,国产又色又爽又刺激的a片,国产又色又爽又刺激的a片

Python爬蟲的數(shù)據(jù)入庫操作(python爬蟲導入數(shù)據(jù)庫)

如何用python爬取豆瓣讀書的數(shù)據(jù)

如何通過python操作xampp里面的MySQL數(shù)據(jù)庫

掃二維碼與項目經(jīng)理溝通

其他資訊

行業(yè)動態(tài)

企業(yè)網(wǎng)站建設的重要性！

服務項目

網(wǎng)站建設

移動端/APP

微信/小程序

技術支持

其它服務

更多服務項目

聯(lián)系吧 在百度地圖上找到我們

電話：13518219792

企業(yè)網(wǎng)站建設的重要性！

聯(lián)系吧在百度地圖上找到我們