How to perform unauthenticated Instagram web scraping in response to recent private API changes?

Values to persist

You aren't persisting the User Agent (a requirement) in the first query to Instagram:

const initResponse = await superagent.get('https://www.instagram.com/');

Should be:

const initResponse = await superagent.get('https://www.instagram.com/')
                     .set('User-Agent', userAgent);

This must be persisted in each request, along with the csrftoken cookie.

X-Instagram-GIS header generation

As your answer shows, you must generate the X-Instagram-GIS header from two properties, the rhx_gis value which is found in your initial request, and the query variables in your next request. These must be md5 hashed, as shown in your function above:

const generateRequestSignature = function(rhxGis, queryVariables) {
    return crypto.createHash('md5').update(`${rhxGis}:${queryVariables}`, 'utf8').digest("hex");
};

So in order to call instagram query you need to generate x-instagram-gis header.

To generate this header you need to calculate a md5 hash of the next string "{rhx_gis}:{path}". The rhx_gis value is stored in the source code of instagram page in the window._sharedData global js variable.

Example:
If you try to GET user info request like this https://www.instagram.com/{username}/?__a=1
You need to add http header x-instagram-gis to request which value is
MD5("{rhx_gis}:/{username}/")

This is tested and works 100%, so feel free to ask if something goes wrong.


Uhm... I don't have Node installed on my machine, so I cannot verify for sure, but looks like to me that you are missing a crucial part of the parameters in querystring, that is the after field:

const queryVariables = JSON.stringify({
    id: "123456789",
    first: 4,
    after: "YOUR_END_CURSOR"
});

From those queryVariables depend your MD5 hash, that, then, doesn't match the expected one. Try that: I expect it to work.

EDIT:

Reading carefully your code, it doesn't make much sense unfortunately. I infer that you are trying to fetch the full stream of pictures from a user's feed.

Then, what you need to do is not calling the Instagram home page as you are doing now (superagent.get('https://www.instagram.com/')), but rather the user's stream (superagent.get('https://www.instagram.com/your_user')).

Beware: you need to hardcode the very same user agent you're going to use below (and it doesn't look like you are...).

Then, you need to extract the query ID (it's not hardcoded, it changes every few hours, sometimes minutes; hardcoding it is foolish – however, for this POC, you can keep it hardcoded), and the end_cursor. For the end cursor I'd go for something like this:

const endCursor = (RegExp('end_cursor":"([^"]*)"', 'g')).exec(initResponse.text)[1];

Now you have everything you need to make the second request:

const queryVariables = JSON.stringify({
    id: "123456789",
    first: 9,
    after: endCursor
});

const signature = generateRequestSignature(rhxGis, csrfTokenCookie, queryVariables);

const res = await superagent.get('https://www.instagram.com/graphql/query/')
    .query({
        query_hash: '42323d64886122307be10013ad2dcc44',
        variables: queryVariables
    })
    .set({
        'User-Agent': userAgent,
        'Accept': '*/*',
        'Accept-Language': 'en-US',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'close',
        'X-Instagram-GIS': signature,
        'Cookie': `rur=${rurCookie};csrftoken=${csrfTokenCookie};mid=${midCookie};ig_pr=1`
    }).send();