Transforming BioASQ Factoid Questions into SQuAD Format: A Preprocessing Approach

BioASQ is a large-scale question-answering dataset widely used for training and evaluating machine learning models for biomedical question answering. Converting BioASQ to the popular SQuAD (Stanford Question Answering Dataset) format can be challenging due to differences in the structure and representation of the dataset. This article provides a Python script to pre-process the BioASQ factoid question dataset to SQuAD format.
Overview of the Script
The provided Python script reads BioASQ data from a JSON file and fetches abstracts and titles from PubMed using PMIDs (PubMed IDs) for constructing the “context” field and structures the data into the SQuAD format.
Detailed Breakdown of the Code
- Imports and Dependencies:
The script begins by importing the necessary libraries:
import json
import requests
import xml.etree.ElementTree as ET
import time
import os
json
is used for reading and writing JSON files.requests
is used for making HTTP requests to the PubMed API.xml.etree.ElementTree
is used for parsing XML responses.time
is used to add delays between API requests to avoid overwhelming the server.os
is used for file path manipulations.
2. Fetching PubMed Abstract and Title:
The function fetch_pubmed_abstract_and_title(pmid)
is responsible for fetching the abstract and title of a given PubMed article using the PMID. It constructs the URL for the PubMed API, sends a request, and parses the XML response to extract the title and abstract.
def fetch_pubmed_abstract_and_title(pmid):
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={pmid}&retmode=xml"
try:
response = requests.get(url)
response.raise_for_status() # Raise an exception for bad status codes
try:
root = ET.fromstring(response.content)
abstract_text = root.find(".//AbstractText")
article_title = root.find(".//ArticleTitle")
if abstract_text is not None and article_title is not None:
context = f"{article_title.text}. {abstract_text.text}"
return context
else:
print(f"No abstract or title found for PMID {pmid}")
return ""
except ET.ParseError as e:
print(f"XML parsing error for PMID {pmid}: {str(e)}")
print(f"Response content: {response.content[:500]}...") # Print first 500 characters
return ""
except requests.RequestException as e:
print(f"Request error for PMID {pmid}: {str(e)}")
return ""
- The function constructs the PubMed API URL using the provided
pmid
. - It sends an HTTP GET request to the PubMed API and checks for errors.
- If the request is successful, it parses the XML response to extract the abstract and title.
- If both the abstract and title are found, it concatenates them to form the context and returns it.
- If any error occurs (request or parsing), it prints an error message and returns an empty string.
3. Preprocessing BioASQ to SQuAD:
The preprocess_bioasq_to_squad(input_json)
function reads the input JSON, processes each question, fetches corresponding abstracts, and converts the data into SQuAD format. It includes logic to handle factoid-type questions, extract PMIDs, and structure the data appropriately.
def preprocess_bioasq_to_squad(input_json):
# Parse the input JSON
data = json.loads(input_json)
# Initialize the SQuAD format structure
squad_format = {
"version": "v1.0",
"data": [{
"title": "BioASQ Dataset",
"paragraphs": []
}]
}
# Process each question in the dataset
for question in data['questions']:
# Only process factoid type questions
if question['type'] != 'factoid':
continue
# Process each document URL
for doc_url in question['documents']:
# Extract PMID from the URL
pmid = doc_url.split('/')[-1]
# Fetch the abstract and title from PubMed
context = fetch_pubmed_abstract_and_title(pmid)
if not context:
print(f"Skipping PMID {pmid} due to empty abstract or title")
continue # Skip if no abstract or title found
# Process answers
exact_answers = question['exact_answer']
if not isinstance(exact_answers, list):
exact_answers = [exact_answers]
# Flatten nested lists
flat_answers = []
for answer in exact_answers:
if isinstance(answer, list):
flat_answers.extend(answer)
else:
flat_answers.append(answer)
# Create a separate QA pair for each answer
for i, answer in enumerate(flat_answers):
paragraph = {
"context": context,
"qas": [{
"id": f"{question['id']}_{pmid}_{i}",
"question": question['body'],
"answers": [],
"is_impossible": True
}]
}
# Find the start position of the answer in the context
answer_start = context.lower().find(answer.lower())
if answer_start != -1:
paragraph['qas'][0]['answers'].append({
"text": answer,
"answer_start": answer_start
})
paragraph['qas'][0]['is_impossible'] = False
# Add the paragraph to the main dataset
squad_format['data'][0]['paragraphs'].append(paragraph)
# Add a small delay to avoid overwhelming the PubMed server
time.sleep(0.5)
return json.dumps(squad_format, indent=2)
- The function reads the BioASQ JSON data.
- It initializes a dictionary to hold the SQuAD format data.
- For each question in the dataset, it checks if the question is of type
factoid
. - For each document URL associated with the question, it extracts the PMID and fetches the abstract and title from PubMed.
- It processes the exact answers and flattens any nested lists of answers.
- For each answer, it creates a QA pair, finds the answer’s start position in the context, and updates the paragraph structure.
- It adds a small delay between requests to avoid overwhelming the PubMed server.
- The function returns the processed data in SQuAD format as a JSON string.
4. Reading Input and Writing Output:
The script reads the BioASQ input JSON file, processes it using the preprocess_bioasq_to_squad
function, and writes the converted data to a new JSON file.
# Read the input JSON from the file
input_file_path = '/content/test_process.json'
with open(input_file_path, 'r') as file:
input_json = file.read()
# Preprocess the data
squad_json = preprocess_bioasq_to_squad(input_json)
# Determine the output file name
input_file_name = os.path.basename(input_file_path)
output_file_name = os.path.splitext(input_file_name)[0] + '_preprocessed.json'
output_file_path = os.path.join(os.path.dirname(input_file_path), output_file_name)
# Write the preprocessed data to the new file
with open(output_file_path, 'w') as file:
file.write(squad_json)
print(f"Preprocessing complete. Output saved to '{output_file_path}'")
- The script reads the input JSON file containing the BioASQ data.
- It calls the
preprocess_bioasq_to_squad
function to convert the data into SQuAD format. - It determines the output file name based on the input file name.
- It writes the converted data to a new JSON file.
- It prints a message indicating the location of the output file.
Customization
To customize this script for your needs, you can:
- Modify the URL and request parameters in the
fetch_pubmed_abstract_and_title
function if you need to fetch additional data or use a different API. - Adjust the way questions and answers are processed in the
preprocess_bioasq_to_squad
function to handle different types of questions or additional answer formats. - Change the file paths and input/output handling to suit your environment or workflow.
Conclusion
This article provides a detailed walk-through of a python script to transform BioASQ dataset to SQuAD format.
Resources:
Google Collab Link : https://colab.research.google.com/drive/1ehhfsvRkAS_SKZbPbKloV3ndEXWkmVI8?usp=sharing