locked
Q about "Native Code Performance and Memory: The Elephant in the CPU" video RRS feed

  • General discussion

  • Explanation why speedup was low for 48MB array
    (only 30 MB cache per cores block) seems unlikely to me, since you are accessing data in linear way, not in random access kind of way.

    Can somebody please correct me or say that  Im right. :)

    Thursday, July 4, 2013 3:30 PM

All replies

  • I think it was measured on the second pass.

    On the first pass 2 packages in NUMA config can read faster assuming they each read from their local memory. But still won't be 18x 

    • Edited by dimkaz Thursday, July 4, 2013 4:19 PM added numa cooment
    Thursday, July 4, 2013 4:18 PM
  • I still think that even of 1st pass prefetchers are smart enough to figure out linear access...

    It COULD be bandwidth problem, but then again believe it or not I dont have that machine to test it. :P

    Thursday, July 4, 2013 4:32 PM
  • Hi, thanks for the question.

    I should have made this clearer in my talk. The key here is that the function accessing 48MB is called often. It is constantly striding through 48MB of data. In that regard, for the most part, the best performance is achieved when keeping the data as close to the execution units as possible.

    Strategic prefetching could work in this case. Instead of having 48MB of data resident in the on-chip caches at all times, you can just ensure the data is in the caches prior to being read/written. In certain circumstances prefetching is beneficial -- but you should ask yourself if that is really the case here.

    Consider option A where we use up a ton of L3 cache to ensure the data is resident at all times. Consider option B where we use strategic prefetching to use less cache (overall), but we are constantly streaming data in from memory. It's possible to get this case where options A and B perform the same in terms of wall-clock time.

    BUT, are you sure you want to be hogging the memory bus for this? Maybe there are other threads on the system that are being impeded by the memory bus being hogged via prefetching. Similarly, there could be a power concern (read: battery life on phones & tablets). I haven't measured this -- just conjecture at this point.

    I hope this clears it up!

    Eric

    Wednesday, July 17, 2013 9:07 PM
  • Hi,

    thank you for your answer. :)

    Just to be clear I was talking about HW prefetching, idk if you are talking about forcing reads to make data hot....

    Other than that... doh...

    I thought you are talking about just 1 pass of the function, so that is why I was confused. :)

    BTW regarding linear access for first read on cold array - do you know if your cache perf in cases when data>L2cache is limited by L2<->RAM bandwith or?

    Aka my feeling is that pref etcher is smart enough to *try* to get you data when you do linear access but it just cant get it fast enough because it is limited by bandwidth, not because it doesn't know what to prefetch next.

    If somebody is bored and wants to run some cache stress code here  is my hacked program :)

    // Cacheless in Seattle.cpp : Defines the entry point for the console application.
    //
    #include <vector>
    #include <cstdint>
    #include <iostream>
    #include <iomanip>
    #include <numeric>
    #include <cassert>
    #include <chrono>
    static const size_t L2_CACHE_SIZE = 512 * 1024;
    using namespace std;
    void trash_cache()
    {
        vector<uint8_t> vec(L2_CACHE_SIZE);
        if (accumulate(begin(vec), end(vec), 0) == 42) //force usage
            exit(1234);
    }
    int64_t sum_array(const vector<int>& vec)
    {
        auto sum = accumulate(begin(vec), end(vec), 0);
        if (sum * rand() * rand() * rand() * rand() == 42)
            exit(12345);
        else
            return sum;
    }
    template <typename Func, typename Param>
    chrono::milliseconds measure_func_run_time(Func func, Param param)
    {
        trash_cache();
        auto t_start = chrono::high_resolution_clock::now();
        int64_t val;
        static const int kNRepeatFunctionCall = 2500;
        for (int i = 0; i < kNRepeatFunctionCall; ++i)
            val += func(param);
        auto t_end  = chrono::high_resolution_clock::now();
        if (val == 42)
            exit(123);
        return chrono::duration_cast<std::chrono::milliseconds> (t_end - t_start);
    }
    int main()
    {
        vector<double> sizes{ 0.125, 0.25, 0.5, 0.75, 1, 1.5, 2, 4, 8, 32 };
        for (const auto size: sizes)
        {
            size_t integer_size = static_cast<size_t> (L2_CACHE_SIZE*size);
            assert(integer_size > 0);
            vector<int> vec_qtr(integer_size);
            auto time = measure_func_run_time(sum_array, vec_qtr);
            cout << size << "   t=" << time.count() << "ms " << setw(8) << " \"speed\":" << size / time.count() << endl;
        }

    }


    • Edited by NoSenseEtAl Wednesday, July 24, 2013 8:53 PM edit
    Wednesday, July 24, 2013 8:51 PM