Microsoft Academic Graph: practical experience, step 2

Alibek Jakupov

Sep 2, 20194 min read

Updated: Nov 19, 2021

Here are the steps implemented in the previous article:

Get Microsoft Academic Graph on Azure storage
Set up Azure Data Lake Analytics for Microsoft Academic Graph
Compute author h-index using Azure Data Lake Analytics (U-SQL)

In today's article we are going to add Azure Search service to implement full-text search on the MAG data.

Define functions to extract MAG data

1. Add new job to Azure ADLA

a. Copy code from samples/CreateFunctions.usql

The code may be found in the previous article as well as on github.

Generate text documents for academic papers

The goal of this step is to submit an ADLA job to generate text files containing academic data that will be used to create an Azure Search service.

Before creating a job, it is necessary to create a blob account where the output files should go to.

Here the Storage Account configurations:

Subscription

Your-subscription

Resource group

Your-resource-group

Location

e.g. (Europe) West Europe

Storage account name

academicoutput - (this value will be used in a future job)

Deployment model

Resource manager

Account kind

StorageV2 (general purpose v2)

Replication

Read-access geo-redundant storage (RA-GRS)

Performance

Standard

Access tier (default)

Hot

Secure transfer required

Enabled

Allow access from

All networks

Hierarchical namespace

Disabled

Blob soft delete

Disabled

Next it is necessary to create an output container. From the newly created resource go to blobs and click on +Container. The name of output container defined in this experiment is “academic-output-container”

Finally, it is necessary to add this blob as a data source.

Go to the ADLA account, click on data sources

And Add Data Source

In the dialog menu provide the following information:

Storage type: Azure Storage
Selection method: Select Account
Azure storage: newly created storage account (academic output in this case).

Important: if you don’t add the AS account as a data source, an exception will be thrown during each execution.

At this step, the following data sources should be present in the ADLA account

Blob with MAG dataADLS (Azure Data Lake Storage) accountBlob for output data

At this step everything is ready to add a new job to ADLA service.

Add new job.

<MagAzureStorageAccount> = Blob storage Account containing mag data (same as in the previous article)

<MagContainer> = mag-yyyy-mm-dd (same as in the previous section)

<OutputAzureStorageAccount> = name of the newly created Blob storage account (academicoutput in this experiment)

<OutputContainer> = newly created container in the output storage (academic-output-container in this experiment)

N.B. It is recommended to change AUs value before launching the job.

Here is the execution summary.

AUs: 32, input: 61.8 GB, output: 38 GB, estimated cost: EUR 9.14, efficiency: 67%, preparing: 49s, running: 13m 25s, duration: 14m 14s.

Create Azure Search service

1. From Azure Portal, create a resource -> Azure Search

Important: Create a new resource group for the service with the same name as the service. In this experiment we called the newly created resource ‘academic-search’.

Important: to ensure the best performance, use the same location as the Azure storage account containing the Microsoft Academic Graph data.

In this experiment we chose Free tier, however only ONE free tier is available per subscription.

2. Once the new service has been created, navigate to the overview section of the service and get the URL

3. Navigate to the keys section of the service and get the primary admin key

Configure initial Postman request and create data source

In Postman (to download the application follow the link: https://www.getpostman.com/downloads/) provide the following information:

url: url-obtained-from-previous-section/ datasources?api-version=2019-05-06

N.B. Api versions may be found in “Search Explorer” menu of Azure Search

method: post

Headers

api-key: primary-admin-key from previous section

Content-type: application/json

Body

{

   "name" : "azure-search-data",

   "type" : "azureblob",

   "credentials" : { "connectionString" : "<AzureStorageAccountConnectionString>" },

   "container”: { "name" : "<MagContainer>", "query" : "azure-search-data" }
}

Connection string should point to the output blob storage (academicoutput in this experiment)

Mag container should be in the output blob storage (academic-output-container in this experiment).

You should receive a "201 created" response.

Create index

Here are the request details to create index

url: url-obtained-from-previous-section/indexes?api-version=2019-05-06

method: post

Headers

api-key: primary-admin-key from previous section

Content-type: application/json

Body

{

   "name": "mag-index", 

   "fields": [

       {"name": "id", "type": "Edm.String", "key": true, "filterable": false, "searchable": false, "sortable": false, "facetable": false},

       {"name": "rank", "type": "Edm.Int32", "filterable": true, "searchable": false, "facetable": false, "sortable": true},

       {"name": "year", "type": "Edm.String", "filterable": true, "searchable": true, "facetable": false, "sortable": false},

       {"name": "journal", "type": "Edm.String", "filterable": true, "searchable": true, "facetable": false, "sortable": false},

       {"name": "conference", "type": "Edm.String", "filterable": true, "searchable": true, "facetable": false, "sortable": false},

       {"name": "authors", "type": "Collection(Edm.String)", "filterable": true, "searchable": true, "facetable": false, "sortable": false},

       {"name": "volume", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},

       {"name": "issue", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},

       {"name": "first_page", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},

       {"name": "last_page", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},

       {"name": "title", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false},

       {"name": "doi", "type": "Edm.String", "filterable": false, "searchable": true, "facetable": false, "sortable": false}

   ]
}

Create indexers

Here are the request details to create index

url: url-obtained-from-previous-section/indexers?api-version=2019-05-06

method: post

Headers

api-key: primary-admin-key from previous section

Content-type: application/json

Body

{

   "name" : "mag-indexer-1",

   "dataSourceName" : "azure-search-data",

   "targetIndexName" : "mag-index",

   "schedule" : {

       "interval" : "PT5M"

   },

   "parameters" : {

       "configuration" : {

           "parsingMode" : "delimitedText",

           "delimitedTextHeaders" : "id,rank,year,journal,conference,authors,volume,issue,first_page,last_page,title,doi",

           "delimitedTextDelimiter": "         ",

           "firstLineContainsHeaders": false,

           "indexedFileNameExtensions": ".0"

       }

   }
}

This will create one indexer for .0 Indexed filename extensions. It is recommended to create six indexers each targeting a specific subset of the text documents generated earlier.

Thus, it is necessary to repeat the procedure 5 more times by changing indexedFileNameExtensions value [.0, .1, .2, .3, .4, .5] and changing the indexer name.

Important: as we are using Free tier only 3 indexers are available.

Indexer quota of 3 has been exceeded for this service. You currently have 3 indexers. You must either delete unused indexers first, or upgrade the service for higher limits

Scale up the service

This step is needed to scale up the services search units (SU) to ensure that each indexer can be run concurrently. To do this it is needed to go to the scale section of the service and change the number of partitions and number of replicas.

Important: this is not available in a Free tier. Create a Standard search service for scalability and greater performance. The indexing operation can take a long time to complete, likely between 16-24 hours.

N.B. to see the indexer status run the following command from postman

url: url-obtained-from-previous-section/indexers/[indexer name]/status?api-version=2019-05-06

method: get

Headers

api-key: primary-admin-key from previous section

Content-type: application/json

Body

None

Reference parsing with search explorer

Here is the needed information that should be provided to perform reference parsing with Azure Search REST API

url: url-obtained-from-previous-section/indexes/mag-index/docs/search?api-version=2019-05-06

method: post

Headers

api-key: primary-admin-key from previous section

Content-type: application/json

Body

{ 

    "highlight": "year,journal,conference,authors,volume,issue,first_page,last_page,title,doi", 

    "highlightPreTag": "<q>", 

    "highlightPostTag": "</q>", 

    "search": "Lloyd, K., Wright, S., Suchet-Pearson, S., Burarrwanga, L., Hodge, & P. (2012). Weaving lives together: collaborative fieldwork in North East Arnhem Land, Australia. Annales de Géographie, 121(5), 513–524.", 

    "searchFields": "year,journal,conference,authors,volume,issue,first_page,last_page,title,doi", 

    "select": "id,rank,year,title",

    "top": 2
}

Hope this was helpful.

Microsoft Academic Graph: practical experience, step 2

Define functions to extract MAG data

Generate text documents for academic papers

Create Azure Search service

Configure initial Postman request and create data source

Headers

Body

Create index

Headers

Body

Create indexers

Headers

Body

Scale up the service

Headers

Body

Reference parsing with search explorer

Headers

Body

Recent Posts

Comments