tsv csv txt json格式文件的處理方法是怎樣的,相信很多沒有經(jīng)驗的人對此束手無策,為此本文總結(jié)了問題出現(xiàn)的原因和解決方法,通過這篇文章希望你能解決這個問題。
創(chuàng)新互聯(lián)公司于2013年創(chuàng)立,先為甘肅等服務(wù)建站,甘肅等地企業(yè),進行企業(yè)商務(wù)咨詢服務(wù)。為甘肅企業(yè)網(wǎng)站制作PC+手機+微官網(wǎng)三網(wǎng)同步一站式服務(wù)解決您的所有建站問題。
對于tsv、csv、txt以及json類型的數(shù)據(jù)的處理方法一般可以使用torchtext中的TabularDataset進行處理;
數(shù)據(jù)的要求:
tsv: 第一行fields字段名,使用tab隔開,其它行為數(shù)據(jù),每個字段直接的數(shù)據(jù)使用tab隔開;
csv: 第一行fields字段,其它行為數(shù)據(jù)
json: 字典類型,每一行為一個字典,字典的key為fields,values為數(shù)據(jù)。
本次采用以下tsv格式的數(shù)據(jù)集:
sentiment-analysis-on-movie-reviews.zip
數(shù)據(jù)集的格式:
注意:如果test數(shù)據(jù)集中缺少某些字段,使用torchtext處理時會有問題,因此要保證train val和test數(shù)據(jù)集要處理的字段必需相同。
任務(wù):構(gòu)造一個翻譯類型的數(shù)據(jù)集
inputs:[sequence english] target:[sequence chinese]
from torchtext.data import Field, TabularDataset, BucketIterator import torch batch_size = 6 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') tokenize_x = lambda x: x.split() tokenize_y = lambda y: y TEXT = Field(sequential=True, use_vocab=True, tokenize=tokenize_x, lower=True, batch_first=True, init_token='', eos_token=' ') LABEL = Field(sequential=False, use_vocab=False, tokenize=tokenize_y, batch_first=True, init_token=None, eos_token=None) # fields = {'english': ('en', ENGLISH), 'chinese': ('cn', CHINESE)} # The first of element tuple was tsv's fields_name fields = [("PhraseId", None), ("SentenceId", None), ("Phrase", TEXT), ("Sentiment", LABEL)] train_data, test_data = TabularDataset.splits(path='data', train='movie-sentiment_train.tsv', test='movie-sentiment_test.tsv', format='tsv', skip_header=True, fields=fields) TEXT.build_vocab(train_data, max_size=10000, min_freq=2) VOCAB_SIZE = len(TEXT.vocab) # The operation of vocabulary print("vocabulary size: ", VOCAB_SIZE) print(TEXT.vocab.freqs) print(TEXT.vocab.itos[:10]) for i, v in enumerate(TEXT.vocab.stoi): if i == 10: break print(v) print(TEXT.vocab.stoi['apple']) print(' indx is ', TEXT.vocab.stoi[' ']) print(' indx is ', TEXT.vocab.stoi[' ']) UNK_STR = TEXT.unk_token PAD_STR = TEXT.pad_token UNK_IDX = TEXT.vocab.stoi[UNK_STR] PAD_IDX = TEXT.vocab.stoi[PAD_STR] print(f'{UNK_STR} index is {UNK_IDX}') print(f'{PAD_STR} index is {PAD_IDX}') # The operation of datasets print(len(train_data)) print(train_data[0].__dict__.keys()) print(train_data[0].__dict__.values()) # vars return attribute of object print(vars(train_data.examples[0])) print(train_data[0].Phrase) print(train_data[0].Sentiment) """ batch_sizes: Tuple of batch sizes to use for the different splits, or None to use the same batch_size for all splits. """ train_iterator, test_iterator = BucketIterator.splits((train_data, test_data), batch_size=32, batch_sizes=None, device=device, repeat=False, # shuffle=True, sort_key=lambda x: len(x.Phrase), sort=False, sort_within_batch=True) for batch in train_iterator: print(batch.Phrase.shape) print([TEXT.vocab.itos[idx] for idx in batch.Phrase[0]]) print(batch.Sentiment) break
如果只有一個文本數(shù)據(jù)需要處理,將splits方法去除,修改以下初始化參數(shù),修改的代碼如下:
fields = [("PhraseId", None), ("SentenceId", None), ("Phrase", TEXT), ("Sentiment", LABEL)] train_data = TabularDataset(path='data/movie-sentiment_train.tsv', format='tsv', skip_header=True, fields=fields) train_iterator = BucketIterator(train_data, batch_size=batch_size, device=device, shuffle=False, repeat = False, sort_key=lambda x: len(x.Phrase), sort_within_batch=False)
fields是否需要use_vocab為True,即是否需要建立一個字典:
對于inputs數(shù)據(jù)而言,都需要進行詞典的建立,而對于labels而言,如果labels是數(shù)字類型的數(shù)據(jù)(實際是string類型),通常在iterator會使用int()強制轉(zhuǎn)換成longTensor()類型,如果labels不是數(shù)字類型的數(shù)據(jù),需要建立一個詞典,這樣在iterator會字段轉(zhuǎn)換成longTensor類型。
關(guān)于TabularDataset中fieds字段傳入list和dict的區(qū)別:
list
構(gòu)造fields時必須按照數(shù)據(jù)集中fields字段的順序依次構(gòu)造,優(yōu)點: 數(shù)據(jù)集第一行可以不寫字段名,缺點:train test val數(shù)據(jù)集所有字段必須完全相同。
TabularDataset中skip_header字段要根據(jù)數(shù)據(jù)集的第一行是否有fields名稱設(shè)置成True或者False。
fields = [("PhraseId", None), ("SentenceId", None), ("Phrase", TEXT), ("Sentiment", LABEL)]
dict
構(gòu)造fields時可以根據(jù)自己的需要選擇性的選擇字段,優(yōu)點:train test val數(shù)據(jù)集所有字段可以不完全相同,缺點:數(shù)據(jù)集的第一行必須有字段名稱。
TabularDataset中skip_header字段必須是False。
fields = {'Phrase': ('Phrase', TEXT), 'Sentiment': ('Sentiment', LABEL)}
BucketIterator中sort和shuffle問題:
shuffle參數(shù)用于是否打亂每個batch的取出順序,推薦使用默認(rèn)參數(shù),即train數(shù)據(jù)集打亂,其它數(shù)據(jù)集不打亂;
sort_key=lambda x: len(x.Phrase): 按照何種方式排序;
sort: 對所有數(shù)據(jù)集進行降序排序;推薦False.
sort_within_batch:對每個batch進行升序排序;推薦使用True.
任務(wù):構(gòu)造一個翻譯類型的數(shù)據(jù)集
inputs:[english, chinese] target:[(english, en_len, chinese, cn_len), (...)]
步驟:
分詞生成兩維的列表
分別創(chuàng)建詞典
根據(jù)詞典使用索引替換英文和中文詞
構(gòu)造batch
根據(jù)英文句子個數(shù)和batchSize構(gòu)造batch的索引組
根據(jù)創(chuàng)建的batch索引,構(gòu)造batch數(shù)據(jù),并返回每句話的長度list
import torch import numpy as np import nltk import jieba from collections import Counter UNK_IDX = 0 PAD_IDX = 1 batch_size = 64 train_file = 'data/translate_train.txt' dev_file = 'data/translate_dev.txt' """ 數(shù)據(jù)格式: english \t chinese 讀取英文中文翻譯文件, 句子開頭和結(jié)尾分別加上返回兩個列表 """ def load_data(in_file): cn = [] en = [] with open(in_file, 'r', encoding='utf-8') as f: for line in f: line = line.strip().split("\t") en.append(['BOS'] + nltk.word_tokenize(line[0].lower()) + ['EOS']) # cn.append(['BOS'] + [c for c in line[1]] + ['EOS']) cn.append(['BOS'] + jieba.lcut(line[1]) + ['EOS']) return en, cn """ 創(chuàng)建詞典 """ def build_dict(sentences, max_words=50000): word_count = Counter() for sentence in sentences: for s in sentence: word_count[s] += 1 ls = word_count.most_common(max_words) total_words = len(ls) + 2 word_dict = {w[0]: index for index, w in enumerate(ls, 2)} word_dict['UNK'] = UNK_IDX word_dict['PAD'] = PAD_IDX return word_dict, total_words # 把句子變成索引 def encode(en_sentences, cn_sentences, en_dict, cn_dict, sort_by_len=True): """ Encode the sequences. """ length = len(en_sentences) # 將句子的詞轉(zhuǎn)換成詞典對應(yīng)的索引 out_en_sentences = [[en_dict.get(w, 0) for w in sent] for sent in en_sentences] out_cn_sentences = [[cn_dict.get(w, 0) for w in sent] for sent in cn_sentences] def len_argsort(seq): return sorted(range(len(seq)), key=lambda x: len(seq[x])) if sort_by_len: sorted_index = len_argsort(out_en_sentences) out_en_sentences = [out_en_sentences[i] for i in sorted_index] out_cn_sentences = [out_cn_sentences[i] for i in sorted_index] return out_en_sentences, out_cn_sentences def get_minibatches(n, minibatch_size, shuffle=False): idx_list = np.arange(0, n, minibatch_size) # [0, 1, ..., n-1] if shuffle: np.random.shuffle(idx_list) minibatches = [] for idx in idx_list: minibatches.append(np.arange(idx, min(idx + minibatch_size, n))) return minibatches def prepare_data(seqs, padding_idx): lengths = [len(seq) for seq in seqs] n_samples = len(seqs) max_len = np.max(lengths) x = np.full((n_samples, max_len), padding_idx).astype('int32') x_lengths = np.array(lengths).astype("int32") for idx, seq in enumerate(seqs): x[idx, :lengths[idx]] = seq return x, x_lengths #x_mask def gen_examples(en_sentences, cn_sentences, batch_size): minibatches = get_minibatches(len(en_sentences), batch_size) all_ex = [] for minibatch in minibatches: mb_en_sentences = [en_sentences[t] for t in minibatch] mb_cn_sentences = [cn_sentences[t] for t in minibatch] mb_x, mb_x_len = prepare_data(mb_en_sentences, PAD_IDX) mb_y, mb_y_len = prepare_data(mb_cn_sentences, PAD_IDX) all_ex.append((mb_x, mb_x_len, mb_y, mb_y_len)) return all_ex train_en, train_cn = load_data(train_file) dev_en, dev_cn = load_data(dev_file) en_dict, en_total_words = build_dict(train_en) cn_dict, cn_total_words = build_dict(train_cn) inv_en_dict = {v: k for k, v in en_dict.items()} inv_cn_dict = {v: k for k, v in cn_dict.items()} train_en, train_cn = encode(train_en, train_cn, en_dict, cn_dict) dev_en, dev_cn = encode(dev_en, dev_cn, en_dict, cn_dict) print(" ".join([inv_cn_dict[i] for i in train_cn[100]])) print(" ".join([inv_en_dict[i] for i in train_en[100]])) train_data = gen_examples(train_en, train_cn, batch_size) dev_data = gen_examples(dev_en, dev_cn, batch_size) print(len(train_data)) print(train_data[0])
看完上述內(nèi)容,你們掌握tsv csv txt json格式文件的處理方法是怎樣的的方法了嗎?如果還想學(xué)到更多技能或想了解更多相關(guān)內(nèi)容,歡迎關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道,感謝各位的閱讀!