技术标签:
【中文标题】如何使用正则表达式将特定子字符串提取到新行中?【英文标题】:How to extract specific substrings into new rows, using regex? 【发布时间】:2020-03-07 09:08:57 【问题描述】:我有一个数据框,其中包含用户和客户代理之间的完整聊天。我想只提取来自用户的消息并从中创建具有相同票证 ID 的新行:
ticket_id = pd.DataFrame(["1","2"]).rename(columns=0:"Ticket-ID")full_chat = pd.DataFrame([ "User foo foo foo 12:12 PM, Agent bar bar bar 12:12 PM, User foo foo 12:13 PM, Agent bar bar 12:13 PM, User foo 12:14 PM, Agent bar 12:14 PM", "User bar bar bar 12:12 PM, Agent foo foo foo 12:12 PM, User bar bar 12:13 PM" ]).rename(columns=0:"Full-Chat")merge_chat = pd.merge(ticket_id, full_chat, left_index=True, right_index=True, how="outer")def _split_row(text): cleaned_text = text.lower() lines = re.findall(r"w*user (.*?) *dd:dd*", cleaned_text) for line in lines: print(line.split())print(merge_chat["Full-Chat"].apply(_split_row))
我希望它是这样的:
Ticket-ID Full-Chat1 foo foo foo1 foo foo1 foo2 bar bar bar2 bar bar
【问题讨论】:
【参考方案1】:IIUC,
merge_chat["Full-Chat"] = merge_chat["Full-Chat"].apply(lambda i: re.findall(r"w*user (.*?) *dd:dd*", i.lower()))
从 Pandas 0.25.0 开始,
merge_chat.explode(column="Full-Chat")
会给你结果
在 0.25.0 之前的版本中,
df = pd.DataFrame(merge_chat["Full-Chat"].tolist(), index=merge_chat["Ticket-ID"]).stack()df = df.reset_index([0, "Ticket-ID"])df.rename(columns=0:"Full-Chat", inplace=True)df Ticket-ID Full-Chat0 1 foo foo foo1 1 foo foo2 1 foo3 2 bar bar bar4 2 bar bar
【讨论】:
【参考方案2】:我对此进行了测试,它可以工作
ticket_id = pd.DataFrame(["1","2"]).rename(columns=0:"Ticket-ID")full_chat = pd.DataFrame(["User foo foo foo 12:12 PM, Agent bar bar bar 12:12 PM, User foo foo 12:13 PM, Agent bar bar 12:13 PM, User foo 12:14 PM, Agent bar 12:14 PM", "User bar bar bar 12:12 PM, Agent foo foo foo 12:12 PM, User bar bar 12:13 PM"]).rename(columns=0:"Full-Chat")merge_chat = pd.merge(ticket_id, full_chat, left_index=True, right_index=True, how="outer")Output_df = pd.DataFrame(columns = ["Ticket-ID","Full-Chat"])def split_row(text,ticket_id): cleaned_text = text.lower() lines = re.findall(r"w*user (.*?) *dd:dd*", cleaned_text) return_df = pd.DataFrame(columns = ["Ticket-ID","Full-Chat"]) for line in lines: New_row = pd.DataFrame("Ticket-ID":[ticket_id],"Full-Chat":[line]) return_df = return_df.append(New_row) return return_dffor index, row in merge_chat.iterrows(): Output_df = Output_df.append(split_row(row["Full-Chat"],row["Ticket-ID"]))Output_df=Output_df[["Ticket-ID", "Full-Chat"]].reset_index(drop=True)Output_df.head()
输出:
Ticket-ID Full-Chat0 1 foo foo foo 1 1 foo foo 2 1 foo 3 2 bar bar bar 4 2 bar bar
【讨论】:
以上是关于如何使用正则表达式将特定子字符串提取到新行中?的主要内容,如果未能解决你的问题,请参考以下文章