Here are the steps implemented in the previous article:
Get Microsoft Academic Graph on Azure storage
Set up Azure Data Lake Analytics for Microsoft Academic Graph
Compute author h-index using Azure Data Lake Analytics (U-SQL)
In today's article we are going to add Azure Search service to implement full-text search on the MAG data.
Define functions to extract MAG data
1. Add new job to Azure ADLA
a. Copy code from samples/CreateFunctions.usql
The code may be found in the previous article as well as on github.
Generate text documents for academic papers
The goal of this step is to submit an ADLA job to generate text files containing academic data that will be used to create an Azure Search service.
Before creating a job, it is necessary to create a blob account where the output files should go to.
Here the Storage Account configurations:
Subscription
Your-subscription
Resource group
Your-resource-group
Location
e.g. (Europe) West Europe
Storage account name
academicoutput - (this value will be used in a future job)
Deployment model
Resource manager
Account kind
StorageV2 (general purpose v2)
Replication
Read-access geo-redundant storage (RA-GRS)
Performance
Standard
Access tier (default)
Hot
Secure transfer required
Enabled
Allow access from
All networks
Hierarchical namespace
Disabled
Blob soft delete
Disabled
Next it is necessary to create an output container. From the newly created resource go to blobs and click on +Container. The name of output container defined in this experiment is “academic-output-container”
Finally, it is necessary to add this blob as a data source.
Go to the ADLA account, click on data sources
And Add Data Source
In the dialog menu provide the following information:
Storage type: Azure Storage
Selection method: Select Account
Azure storage: newly created storage account (academic output in this case).
Important: if you don’t add the AS account as a data source, an exception will be thrown during each execution.
At this step, the following data sources should be present in the ADLA account
Blob with MAG dataADLS (Azure Data Lake Storage) accountBlob for output data
At this step everything is ready to add a new job to ADLA service.
Add new job.
<MagAzureStorageAccount> = Blob storage Account containing mag data (same as in the previous article)
<MagContainer> = mag-yyyy-mm-dd (same as in the previous section)
<OutputAzureStorageAccount> = name of the newly created Blob storage account (academicoutput in this experiment)
<OutputContainer> = newly created container in the output storage (academic-output-container in this experiment)
N.B. It is recommended to change AUs value before launching the job.
Here is the execution summary.
AUs: 32, input: 61.8 GB, output: 38 GB, estimated cost: EUR 9.14, efficiency: 67%, preparing: 49s, running: 13m 25s, duration: 14m 14s.
Create Azure Search service
1. From Azure Portal, create a resource -> Azure Search
Important: Create a new resource group for the service with the same name as the service. In this experiment we called the newly created resource ‘academic-search’.
Important: to ensure the best performance, use the same location as the Azure storage account containing the Microsoft Academic Graph data.
In this experiment we chose Free tier, however only ONE free tier is available per subscription.
2. Once the new service has been created, navigate to the overview section of the service and get the URL
3. Navigate to the keys section of the service and get the primary admin key
Configure initial Postman request and create data source
In Postman (to download the application follow the link: https://www.getpostman.com/downloads/) provide the following information:
url: url-obtained-from-previous-section/ datasources?api-version=2019-05-06
N.B. Api versions may be found in “Search Explorer” menu of Azure Search
method: post
Headers
api-key: primary-admin-key from previous section
Content-type: application/json
Body
{
"name" : "azure-search-data",
"type" : "azureblob",
"credentials" : { "connectionString" : "<AzureStorageAccountConnectionString>" },
"container”: { "name" : "<MagContainer>", "query" : "azure-search-data" }
}
Connection string should point to the output blob storage (academicoutput in this experiment)
Mag container should be in the output blob storage (academic-output-container in this experiment).
You should receive a "201 created" response.
Create index
Here are the request details to create index
url: url-obtained-from-previous-section/indexes?api-version=2019-05-06
method: post
Headers
api-key: primary-admin-key from previous section
Content-type: application/json
Body
{
"name": "mag-index",
"fields": [
{"name": "id", "type": "Edm.String", "key": true, "filterable": false, "searchable": false, "sortable": false, "facetable": false},
{"name": "rank", "type": "Edm.Int32", "filterable": true, "searchable": false, "facetable": false, "sortable": true},
{"name": "year", "type": "Edm.String", "filterable": true, "searchable": true, "facetable": false, "sortable": false},
{"name": "journal", "type": "Edm.String", "filterable": true, "searchable": true, "facetable": false, "sortable": false},
{"name": "conference", "type": "Edm.String", "filterable": true, "searchable": true, "facetable": false, "sortable": false},
{"name": "authors", "type": "Collection(Edm.String)", "filterable": true, "searchable": true, "facetable": false, "sortable": false},
{"name": "volume", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},
{"name": "issue", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},
{"name": "first_page", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},
{"name": "last_page", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},
{"name": "title", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},
{"name": "doi", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false}
]
}
Create indexers
Here are the request details to create index
url: url-obtained-from-previous-section/indexers?api-version=2019-05-06
method: post
Headers
api-key: primary-admin-key from previous section
Content-type: application/json
Body
{
"name" : "mag-indexer-1",
"dataSourceName" : "azure-search-data",
"targetIndexName" : "mag-index",
"schedule" : {
"interval" : "PT5M"
},
"parameters" : {
"configuration" : {
"parsingMode" : "delimitedText",
"delimitedTextHeaders" : "id,rank,year,journal,conference,authors,volume,issue,first_page,last_page,title,doi",
"delimitedTextDelimiter": " ",
"firstLineContainsHeaders": false,
"indexedFileNameExtensions": ".0"
}
}
}
This will create one indexer for .0 Indexed filename extensions. It is recommended to create six indexers each targeting a specific subset of the text documents generated earlier.
Thus, it is necessary to repeat the procedure 5 more times by changing indexedFileNameExtensions value [.0, .1, .2, .3, .4, .5] and changing the indexer name.
Important: as we are using Free tier only 3 indexers are available.
Indexer quota of 3 has been exceeded for this service. You currently have 3 indexers. You must either delete unused indexers first, or upgrade the service for higher limits
Scale up the service
This step is needed to scale up the services search units (SU) to ensure that each indexer can be run concurrently. To do this it is needed to go to the scale section of the service and change the number of partitions and number of replicas.
Important: this is not available in a Free tier. Create a Standard search service for scalability and greater performance. The indexing operation can take a long time to complete, likely between 16-24 hours.
N.B. to see the indexer status run the following command from postman
url: url-obtained-from-previous-section/indexers/[indexer name]/status?api-version=2019-05-06
method: get
Headers
api-key: primary-admin-key from previous section
Content-type: application/json
Body
None
Reference parsing with search explorer
Here is the needed information that should be provided to perform reference parsing with Azure Search REST API
url: url-obtained-from-previous-section/indexes/mag-index/docs/search?api-version=2019-05-06
method: post
Headers
api-key: primary-admin-key from previous section
Content-type: application/json
Body
{
"highlight": "year,journal,conference,authors,volume,issue,first_page,last_page,title,doi",
"highlightPreTag": "<q>",
"highlightPostTag": "</q>",
"search": "Lloyd, K., Wright, S., Suchet-Pearson, S., Burarrwanga, L., Hodge, & P. (2012). Weaving lives together: collaborative fieldwork in North East Arnhem Land, Australia. Annales de Géographie, 121(5), 513–524.",
"searchFields": "year,journal,conference,authors,volume,issue,first_page,last_page,title,doi",
"select": "id,rank,year,title",
"top": 2
}
Hope this was helpful.
Comments